From 95020aa759aeab1115b5af20f6eeac60e89eaab5 Mon Sep 17 00:00:00 2001 From: pollfly <75068813+pollfly@users.noreply.github.com> Date: Wed, 1 Sep 2021 12:48:30 +0300 Subject: [PATCH] Rewrite ClearML Data docs (#56) --- docs/clearml_data.md | 269 +++++++++++++++++++++++-------------------- 1 file changed, 144 insertions(+), 125 deletions(-) diff --git a/docs/clearml_data.md b/docs/clearml_data.md index e2838475..6cc78ea8 100644 --- a/docs/clearml_data.md +++ b/docs/clearml_data.md @@ -12,24 +12,34 @@ ClearML Data Management solves two important challenges: **We believe Data is not code**. It should not be stored in a git tree, because progress on datasets is not always linear. Moreover, it can be difficult and inefficient to find on a git tree the commit associated with a certain version of a dataset. -A `clearml-data` dataset is a collection of files, stored on a central storage location (S3 \ GS \ Azure \ Network Storage). +A `clearml-data` dataset is a collection of files, stored on a central storage location (S3 / GS / Azure / Network Storage). Datasets can be set up to inherit from other datasets, so data lineages can be created, -and users can track when and how their data changes.
-Dataset changes are stored using differentiable storage, meaning a version will store the change-set from its previous dataset parents +and users can track when and how their data changes. + +Dataset changes are stored using differentiable storage, meaning a version will store the change-set from its previous dataset parents. Local copies of datasets are always cached, so the same data never needs to be downloaded twice. -When a dataset is pulled it will automatically pull all parent datasets and merge them into one output folder for you to work with +When a dataset is pulled it will automatically pull all parent datasets and merge them into one output folder for you to work with. -ClearML-data offers two interfaces: +ClearML Data offers two interfaces: - `clearml-data` - CLI utility for creating, uploading, and managing datasets. - `clearml.Dataset` - A python interface for creating, retrieving, managing, and using datasets. -## Creating a Dataset +## Setup + +`clearml-data` comes built-in with our `clearml` python package! Just check out the [getting started](getting_started/ds/ds_first_steps.md) guide for more info! + +## Workflow +Below is an example of a workflow using ClearML Data's command line tool to create a dataset and inegrating the dataset into code +using ClearML Data's python interface. + +### Creating a Dataset Using the `clearml-data` CLI, users can create datasets using the following commands: ```bash clearml-data create --project dataset_example --name initial_version clearml-data add --files data_folder +clearml-data close ``` The commands will do the following: @@ -40,13 +50,15 @@ The commands will do the following: 1. All the files from the "data_folder" folder will be added to the dataset and uploaded by default to the [ClearML server](deploying_clearml/clearml_server.md). + +1. The dataset will be finalized, making it immutable and ready to be consumed. :::note `clearml-data` is stateful and remembers the last created dataset so there's no need to specify a specific dataset ID unless we want to work on another dataset. ::: -## Using a Dataset +### Using a Dataset Now in our python code, we can access and use the created dataset from anywhere: ```python @@ -60,17 +72,11 @@ We have all our files in the same folder structure under `local_path`, it is tha The next step is to set the dataset_id as a parameter for our code and voilĂ ! We can now train on any dataset we have in the system. -## Setup +## CLI Options -`clearml-data` comes built-in with our `clearml` python package! Just check out the [getting started](getting_started/ds/ds_first_steps.md) guide for more info! +It's possible to manage datasets (create / modify / upload / delete) with the `clearml-data` command line tool. -## Usage - -### CLI - -It's possible to manage datasets (create \ modify \ upload \ delete) with the `clearml-data` command line tool. - -#### Creating a Dataset +### Creating a Dataset ```bash clearml-data create --project --name --parents ` ``` @@ -80,10 +86,10 @@ Creates a new dataset.
|Name|Description|Optional| |---|---|---| -|name |Dataset's name| | -|project|Dataset's project| | -|parents|IDs of the dataset's parents. The dataset inherits all of its parents' content. Multiple parents can be entered, but they are merged in the order they were entered| | -|tags |Dataset user tags. The dataset can be labeled, which can be useful for organizing datasets| | +|name |Dataset's name| No | +|project|Dataset's project| No | +|parents|IDs of the dataset's parents. The dataset inherits all of its parents' content. Multiple parents can be entered, but they are merged in the order they were entered| Yes | +|tags |Dataset user tags. The dataset can be labeled, which can be useful for organizing datasets| Yes| :::important clearml-data works in a stateful mode so once a new dataset is created, the following commands @@ -92,7 +98,7 @@ do not require the `--id` flag.
-#### Add Files to Dataset +### Add Files ```bash clearml-data add --id --files ``` @@ -102,15 +108,15 @@ It's possible to add individual files or complete folders.
|Name|Description|Optional| |---|---|---| -|id | Dataset's ID. Default: previously created / accessed dataset| | -|files|Files / folders to add. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json` | | -|dataset-folder | Dataset base folder to add the files to in the dataset. Default: dataset root| | -|non-recursive | Disable recursive scan of files | | -|verbose | Verbose reporting | | +|id | Dataset's ID. Default: previously created / accessed dataset| Yes | +|files|Files / folders to add. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json` | No | +|dataset-folder | Dataset base folder to add the files to in the dataset. Default: dataset root| Yes | +|non-recursive | Disable recursive scan of files | Yes | +|verbose | Verbose reporting | Yes|
-#### Remove Files From Dataset +### Remove Files ```bash clearml-data remove --id --files ``` @@ -119,33 +125,14 @@ clearml-data remove --id --files | -|files | Files / folders to remove (wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`). Notice: file path is the path within the dataset, not the local path.| | -|non-recursive | Disable recursive scan of files | | -|verbose | Verbose reporting | | +|id | Dataset's ID. Default: previously created / accessed dataset| Yes | +|files | Files / folders to remove (wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`). Notice: file path is the path within the dataset, not the local path.| No | +|non-recursive | Disable recursive scan of files | Yes | +|verbose | Verbose reporting | Yes|
-#### Finalize Dataset -```bash -clearml-data close --id -``` -Finalizes the dataset and makes it ready to be consumed. -It automatically uploads all files that were not previously uploaded. -Once a dataset is finalized, it can no longer be modified. - -**Parameters** - -|Name|Description|Optional| -|---|---|---| -|id| Dataset's ID. Default: previously created / accessed dataset| | -|storage| Remote storage to use for the dataset files. Default: files_server | | -|disable-upload | Disable automatic upload when closing the dataset | | -|verbose | Verbose reporting | | - -
- -#### Upload Dataset' Content +### Upload Dataset Content ```bash clearml-data upload [--id ] [--storage ] ``` @@ -157,13 +144,32 @@ medium by entering an upload destination, such as `s3://bucket`, `gs://`, `azure |Name|Description|Optional| |---|---|---| -|id| Dataset's ID. Default: previously created / accessed dataset| | -|storage| Remote storage to use for the dataset files. Default: files_server | | -|verbose | Verbose reporting | | +|id| Dataset's ID. Default: previously created / accessed dataset| Yes | +|storage| Remote storage to use for the dataset files. Default: files_server | Yes | +|verbose | Verbose reporting | Yes|
-#### Sync Local Folder +### Finalize Dataset +```bash +clearml-data close --id +``` +Finalizes the dataset and makes it ready to be consumed. +It automatically uploads all files that were not previously uploaded. +Once a dataset is finalized, it can no longer be modified. + +**Parameters** + +|Name|Description|Optional| +|---|---|---| +|id| Dataset's ID. Default: previously created / accessed dataset| Yes | +|storage| Remote storage to use for the dataset files. Default: files_server | Yes | +|disable-upload | Disable automatic upload when closing the dataset | Yes | +|verbose | Verbose reporting | Yes| + +
+ +### Sync Local Folder ``` clearml-data sync [--id [--parents '']` ``` @@ -180,19 +186,19 @@ This command also uploads the data and finalizes the dataset automatically. |Name|Description|Optional| |---|---|---| -|id| Dataset's ID. Default: previously created / accessed dataset| | -|folder|Local folder to sync. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`|| -|storage|Remote storage to use for the dataset files. Default: files_server || -|parents|IDs of the dataset's parents (i.e. merge all parents). All modifications made to the folder since the parents were synced will be reflected in the dataset|| -|project|If creating a new dataset, specify the dataset's project name|| -|name|If creating a new dataset, specify the dataset's name|| -|tags|Dataset user tags|| -|skip-close|Do not auto close dataset after syncing folders|| -|verbose | Verbose reporting || +|id| Dataset's ID. Default: previously created / accessed dataset| Yes | +|folder|Local folder to sync. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`|No| +|storage|Remote storage to use for the dataset files. Default: files_server |Yes| +|parents|IDs of the dataset's parents (i.e. merge all parents). All modifications made to the folder since the parents were synced will be reflected in the dataset|Yes| +|project|If creating a new dataset, specify the dataset's project name|Yes| +|name|If creating a new dataset, specify the dataset's name|Yes| +|tags|Dataset user tags|Yes| +|skip-close|Do not auto close dataset after syncing folders|Yes| +|verbose | Verbose reporting |Yes|
-#### List Dataset Content +### List Dataset Content ```bash clearml-data list [--id ] ``` @@ -201,15 +207,15 @@ clearml-data list [--id ] |Name|Description|Optional| |---|---|---| -|id|Dataset ID whose contents will be shown (alternatively, use project / name combination). Default: previously accessed dataset|| -|project|Specify dataset project name (if used instead of ID, dataset name is also required)|| -|name|Specify dataset name (if used instead of ID, dataset project is also required)|| -|filter|Filter files based on folder / wildcard. Multiple filters are supported. Example: `folder/date_*.json folder/sub-folder`|| -|modified|Only list file changes (add / remove / modify) introduced in this version|| +|id|Dataset ID whose contents will be shown (alternatively, use project / name combination). Default: previously accessed dataset|Yes| +|project|Specify dataset project name (if used instead of ID, dataset name is also required)|Yes| +|name|Specify dataset name (if used instead of ID, dataset project is also required)|Yes| +|filter|Filter files based on folder / wildcard. Multiple filters are supported. Example: `folder/date_*.json folder/sub-folder`|Yes| +|modified|Only list file changes (add / remove / modify) introduced in this version|Yes|
-#### Delete a Dataset +### Delete Dataset ``` clearml-data delete [--id ] ``` @@ -221,12 +227,12 @@ This does not work on datasets with children. |Name|Description|Optional| |---|---|---| -|id|ID of dataset to be deleted. Default: previously created / accessed dataset that hasn't been finalized yet|| -|force|Force dataset deletion even if other dataset versions depend on it||| +|id|ID of dataset to be deleted. Default: previously created / accessed dataset that hasn't been finalized yet|Yes| +|force|Force dataset deletion even if other dataset versions depend on it|Yes||
-#### Search for a Dataset +### Search for a Dataset ``` clearml-data search [--name ] [--project ] [--tags ] ``` @@ -245,98 +251,111 @@ Datasets can be searched by project, name, ID, and tags.
-### Python API +### Compare Two Datasets -All API commands should be imported with
-`from clearml import Dataset` +``` +clearml-data compare [--source SOURCE] [--target TARGET] +``` +Compare two datasets (target vs. source). The command returns a comparison summary that looks like this: -#### `Dataset.get(dataset_id=DS_ID).get_local_copy()` - -Returns a path to dataset in cache, and downloads it if it is not already in cache. +``` +Comparison summary: 4 files removed, 3 files modified, 0 files added +``` **Parameters** |Name|Description|Optional| |---|---|---| -|use_soft_links|If True, use soft links. Default: False on Windows, True on Posix systems|| -|raise_on_error|If True, raise exception if dataset merging failed on any file|| +|source|Source dataset id (used as baseline)|No| +|target|Target dataset id (compare against the source baseline dataset)|No| +|verbose|Verbose report all file changes (instead of summary)|Yes| -
+### Merge Datasets -#### `Dataset.get(dataset_id=DS_ID).get_mutable_local_copy()` +``` +clearml-data squash --name NAME --ids [IDS [IDS ...]] +``` -Downloads the dataset to a specific folder (non-cached). If the folder already has contents, specify whether to overwrite -its contents with the dataset contents. +Squash (merge) multiple datasets into a single dataset version. **Parameters** |Name|Description|Optional| |---|---|---| -|target_folder|Local target folder for the writable copy of the dataset|| -|overwrite|If True, recursively delete the contents of the target folder before creating a copy of the dataset. If False (default) and target folder contains files, raise exception or return None|| -|raise_on_error|If True, raise exception if dataset merging failed on any file|| +|name|Create squashed dataset name|No| +|ids|Source dataset IDs to squash (merge down)|No| +|storage|Remote storage to use for the dataset files. Default: files_server |Yes| +|verbose|Verbose report all file changes (instead of summary)|Yes| -
+### Verify Dataset -#### `Dataset.create()` +``` +clearml-data verify [--id ID] [--folder FOLDER] +``` -Create a new dataset. - -Parent datasets can be specified, and the new dataset inherits all of its parent's content. Multiple dataset parents can -be listed. Merging of parent datasets is done based on the list's order, where each parent can override overlapping files -in the previous parent dataset. +Verify that the dataset content matches the data from the local source. **Parameters** |Name|Description|Optional| |---|---|---| -|dataset_name|Name of the new dataset|| -|dataset_project|The project containing the dataset. If not specified, infer project name from parent datasets. If there is no parent dataset, then this value is required|| -|parent_datasets|Expand a parent dataset by adding / removing files|| -|use_current_task|If True, the dataset is created on the current Task. Default: False|| +|id|Specify dataset ID. Default: previously created/accessed dataset|Yes| +|folder|Specify dataset local copy (if not provided the local cache folder will be verified)|Yes| +|filesize| If True, only verify file size and skip hash checks (default: false)|Yes| +|verbose|Verbose report all file changes (instead of summary)|Yes| -
+### Get a Dataset -#### `Dataset.add_files()` +``` +clearml-data get [--id ID] [--copy COPY] [--link LINK] [--overwrite] +``` -Add files or folder into the current dataset. +Get a local copy of a dataset. By default, you get a read only cached folder, but you can get a mutable copy by using the +`--copy` flag. **Parameters** |Name|Description|Optional| |---|---|---| -|path|Add a folder / file to the dataset|| -|wildcard|Add only a specific set of files based on wildcard matching. Wildcard matching can be a single string or a list of wildcards, for example: `~/data/*.jpg`, `~/data/json`|| -|local_base_folder|Files will be located based on their relative path from local_base_folder|| -|dataset_path|Where in the dataset the folder / files should be located|| -|recursive|If True, match all wildcard files recursively|| -|verbose| If True, print to console files added / modified|| +|id| Specify dataset ID. Default: previously created / accessed dataset|Yes| +|copy| Get a writable copy of the dataset to a specific output folder|Yes| +|link| Create a soft link (not supported on Windows) to a read-only cached folder containing the dataset|Yes| +|overwrite| If True, overwrite the target folder|Yes| +|verbose| Verbose report all file changes (instead of summary)|Yes| -
+### Publish a Dataset + +``` +clearml-data publish --id ID +``` + +Publish the dataset for public use. The dataset must be [finalized](#finalize-dataset) before it is published. -#### `Dataset.upload()` -Start file uploading, the function returns when all files are uploaded. **Parameters** |Name|Description|Optional| |---|---|---| -|show_progress|If True, show upload progress bar|| -|verbose|If True, print verbose progress report|| -|output_url|Target storage for the compressed dataset (default: file server). Examples: `s3://bucket/data`, `gs://bucket/data` , `azure://bucket/data`, `/mnt/share/data` || -|compression|Compression algorithm for the Zipped dataset file (default: ZIP_DEFLATED)|| +|id| The dataset task id to be published.|No| -
-#### `Dataset.finalize()` -Closes the dataset and marks it as *Completed*. After a dataset has been closed, it can no longer be modified. -Before closing a dataset, its files must first be uploaded. -**Parameters** +## Python API -|Name|Description|Optional| -|---|---|---| -|verbose|If True, print verbose progress report|| -|raise_on_error|If True, raise exception if dataset finalizing failed|| +It's also possible to manage a dataset using ClearML Data's python interface. +All API commands should be imported with: + +```python +from clearml import Dataset +``` + +See all API commands in the [Dataset](references/sdk/dataset.md) reference page. + +## Tutorials + +Take a look at the ClearML Data tutorials: +* [Dataset Management with CLI and SDK](guides/data%20management/data_man_cifar_classification) +* [Dataset Management with CLI](guides/data%20management/data_man_simple) +* [Folder Sync with CLI](guides/data%20management/data_man_folder_sync)