Edit clearml-data documentation

This commit is contained in:
Erez Schnaider 2021-04-01 11:57:10 +03:00 committed by Allegro AI
parent e8752c54ff
commit ee58f8c3ce

View File

@ -31,21 +31,17 @@ that is both machine and environment agnostic.
clearml-data create --project <my_project> --name <my_dataset_name> clearml-data create --project <my_project> --name <my_dataset_name>
``` ```
- Add local files to the dataset - Add local files to the dataset
``` bashtrue
clearml-data add --id <dataset_id_from_previous_command> --files ~/datasets/best_dataset/
```
- Upload files (Optional: specify storage `--storage` `s3://bucket`, `gs://`, `azure://` or `/mnt/shared/`)
``` bash ``` bash
clearml-data upload --id <dataset_id> clearml-data add --files ~/datasets/best_dataset/
``` ```
- Close dataset - Close dataset and upload files (Optional: specify storage `--storage` `s3://bucket`, `gs://`, `azure://` or `/mnt/shared/`)
``` bash ``` bash
clearml-data close --id <dataset_id> clearml-data close --id <dataset_id>
``` ```
#### Integrating datasets into your code: #### Integrating datasets into your code:
``` python ```python
from argparse import ArgumentParser from argparse import ArgumentParser
from clearml import Dataset from clearml import Dataset
@ -63,21 +59,44 @@ dataset_folder = Dataset.get(dataset_id=args.dataset).get_local_copy()
# go over the files in `dataset_folder` and train your model # go over the files in `dataset_folder` and train your model
``` ```
#### Create dataset from code
Creating datasets from code is especially helpful when some preprocessing is done on raw data and we want to save
preprocessing code as well as dataset in a single Task.
```python
from clearml import Dataset
# Preprocessing code here
dataset = Dataset.create(dataset_name='dataset name',dataset_project='dataset project')
dataset.add_files('/path_to_data')
dataset.upload()
dataset.close()
```
#### Modifying a dataset with CLI: #### Modifying a dataset with CLI:
- Create a new dataset (specify the parent dataset id) - Create a new dataset (specify the parent dataset id)
``` bash ```bash
clearml-data create --name <improved_dataset> --parents <existing_dataset_id> clearml-data create --name <improved_dataset> --parents <existing_dataset_id>
``` ```
- Get a mutable copy of the current dataset - Get a mutable copy of the current dataset
``` bash ```bash
clearml-data get --id <created_dataset_id> --copy ~/datasets/working_dataset clearml-data get --id <created_dataset_id> --copy ~/datasets/working_dataset
``` ```
- Change / add / remove files from the dataset folder - Change / add / remove files from the dataset folder
``` bash ```bash
vim ~/datasets/working_dataset/everything.csv vim ~/datasets/working_dataset/everything.csv
``` ```
#### Folder sync mode
Folder sync mode updates dataset according to folder content changes.<br/>
This is useful in case there's a single point of truth, either a local or network folder that gets updated periodically.
When using `clearml-data sync` and specifying parent dataset, the folder changes will be reflected in a new dataset version.
This saves time manually updating (adding \ removing) files.
- Sync local changes - Sync local changes
``` bash ``` bash
clearml-data sync --id <created_dataset_id> --folder ~/datasets/working_dataset clearml-data sync --id <created_dataset_id> --folder ~/datasets/working_dataset