From ee58f8c3cefe38109654a2eb58ba6deb08744517 Mon Sep 17 00:00:00 2001 From: Erez Schnaider Date: Thu, 1 Apr 2021 11:57:10 +0300 Subject: [PATCH] Edit clearml-data documentation --- docs/datasets.md | 39 +++++++++++++++++++++++++++++---------- 1 file changed, 29 insertions(+), 10 deletions(-) diff --git a/docs/datasets.md b/docs/datasets.md index d192409a..25e8942d 100644 --- a/docs/datasets.md +++ b/docs/datasets.md @@ -31,21 +31,17 @@ that is both machine and environment agnostic. clearml-data create --project --name ``` - Add local files to the dataset -``` bashtrue -clearml-data add --id --files ~/datasets/best_dataset/ -``` -- Upload files (Optional: specify storage `--storage` `s3://bucket`, `gs://`, `azure://` or `/mnt/shared/`) ``` bash -clearml-data upload --id +clearml-data add --files ~/datasets/best_dataset/ ``` -- Close dataset +- Close dataset and upload files (Optional: specify storage `--storage` `s3://bucket`, `gs://`, `azure://` or `/mnt/shared/`) ``` bash clearml-data close --id ``` #### Integrating datasets into your code: -``` python +```python from argparse import ArgumentParser from clearml import Dataset @@ -63,21 +59,44 @@ dataset_folder = Dataset.get(dataset_id=args.dataset).get_local_copy() # go over the files in `dataset_folder` and train your model ``` +#### Create dataset from code +Creating datasets from code is especially helpful when some preprocessing is done on raw data and we want to save +preprocessing code as well as dataset in a single Task. + +```python +from clearml import Dataset + +# Preprocessing code here + +dataset = Dataset.create(dataset_name='dataset name',dataset_project='dataset project') +dataset.add_files('/path_to_data') +dataset.upload() +dataset.close() + +``` #### Modifying a dataset with CLI: - Create a new dataset (specify the parent dataset id) -``` bash +```bash clearml-data create --name --parents ``` - Get a mutable copy of the current dataset -``` bash +```bash clearml-data get --id --copy ~/datasets/working_dataset ``` - Change / add / remove files from the dataset folder -``` bash +```bash vim ~/datasets/working_dataset/everything.csv ``` + +#### Folder sync mode + +Folder sync mode updates dataset according to folder content changes.
+This is useful in case there's a single point of truth, either a local or network folder that gets updated periodically. +When using `clearml-data sync` and specifying parent dataset, the folder changes will be reflected in a new dataset version. +This saves time manually updating (adding \ removing) files. + - Sync local changes ``` bash clearml-data sync --id --folder ~/datasets/working_dataset