Rewrite ClearML Data docs (#56)

This commit is contained in:
pollfly 2021-09-01 12:48:30 +03:00 committed by GitHub
parent c52697fd1b
commit 95020aa759
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -12,24 +12,34 @@ ClearML Data Management solves two important challenges:
**We believe Data is not code**. It should not be stored in a git tree, because progress on datasets is not always linear. **We believe Data is not code**. It should not be stored in a git tree, because progress on datasets is not always linear.
Moreover, it can be difficult and inefficient to find on a git tree the commit associated with a certain version of a dataset. Moreover, it can be difficult and inefficient to find on a git tree the commit associated with a certain version of a dataset.
A `clearml-data` dataset is a collection of files, stored on a central storage location (S3 \ GS \ Azure \ Network Storage). A `clearml-data` dataset is a collection of files, stored on a central storage location (S3 / GS / Azure / Network Storage).
Datasets can be set up to inherit from other datasets, so data lineages can be created, Datasets can be set up to inherit from other datasets, so data lineages can be created,
and users can track when and how their data changes.<br/> and users can track when and how their data changes.
Dataset changes are stored using differentiable storage, meaning a version will store the change-set from its previous dataset parents
Dataset changes are stored using differentiable storage, meaning a version will store the change-set from its previous dataset parents.
Local copies of datasets are always cached, so the same data never needs to be downloaded twice. Local copies of datasets are always cached, so the same data never needs to be downloaded twice.
When a dataset is pulled it will automatically pull all parent datasets and merge them into one output folder for you to work with When a dataset is pulled it will automatically pull all parent datasets and merge them into one output folder for you to work with.
ClearML-data offers two interfaces: ClearML Data offers two interfaces:
- `clearml-data` - CLI utility for creating, uploading, and managing datasets. - `clearml-data` - CLI utility for creating, uploading, and managing datasets.
- `clearml.Dataset` - A python interface for creating, retrieving, managing, and using datasets. - `clearml.Dataset` - A python interface for creating, retrieving, managing, and using datasets.
## Creating a Dataset ## Setup
`clearml-data` comes built-in with our `clearml` python package! Just check out the [getting started](getting_started/ds/ds_first_steps.md) guide for more info!
## Workflow
Below is an example of a workflow using ClearML Data's command line tool to create a dataset and inegrating the dataset into code
using ClearML Data's python interface.
### Creating a Dataset
Using the `clearml-data` CLI, users can create datasets using the following commands: Using the `clearml-data` CLI, users can create datasets using the following commands:
```bash ```bash
clearml-data create --project dataset_example --name initial_version clearml-data create --project dataset_example --name initial_version
clearml-data add --files data_folder clearml-data add --files data_folder
clearml-data close
``` ```
The commands will do the following: The commands will do the following:
@ -40,13 +50,15 @@ The commands will do the following:
1. All the files from the "data_folder" folder will be added to the dataset and uploaded 1. All the files from the "data_folder" folder will be added to the dataset and uploaded
by default to the [ClearML server](deploying_clearml/clearml_server.md). by default to the [ClearML server](deploying_clearml/clearml_server.md).
1. The dataset will be finalized, making it immutable and ready to be consumed.
:::note :::note
`clearml-data` is stateful and remembers the last created dataset so there's no need to specify a specific dataset ID unless `clearml-data` is stateful and remembers the last created dataset so there's no need to specify a specific dataset ID unless
we want to work on another dataset. we want to work on another dataset.
::: :::
## Using a Dataset ### Using a Dataset
Now in our python code, we can access and use the created dataset from anywhere: Now in our python code, we can access and use the created dataset from anywhere:
```python ```python
@ -60,17 +72,11 @@ We have all our files in the same folder structure under `local_path`, it is tha
The next step is to set the dataset_id as a parameter for our code and voilà! We can now train on any dataset we have in The next step is to set the dataset_id as a parameter for our code and voilà! We can now train on any dataset we have in
the system. the system.
## Setup ## CLI Options
`clearml-data` comes built-in with our `clearml` python package! Just check out the [getting started](getting_started/ds/ds_first_steps.md) guide for more info! It's possible to manage datasets (create / modify / upload / delete) with the `clearml-data` command line tool.
## Usage ### Creating a Dataset
### CLI
It's possible to manage datasets (create \ modify \ upload \ delete) with the `clearml-data` command line tool.
#### Creating a Dataset
```bash ```bash
clearml-data create --project <project_name> --name <dataset_name> --parents <existing_dataset_id>` clearml-data create --project <project_name> --name <dataset_name> --parents <existing_dataset_id>`
``` ```
@ -80,10 +86,10 @@ Creates a new dataset. <br/>
|Name|Description|Optional| |Name|Description|Optional|
|---|---|---| |---|---|---|
|name |Dataset's name| <img src="/docs/latest/icons/ico-optional-no.svg" className="icon size-md center-md" /> | |name |Dataset's name| <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|project|Dataset's project| <img src="/docs/latest/icons/ico-optional-no.svg" className="icon size-md center-md" /> | |project|Dataset's project| <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|parents|IDs of the dataset's parents. The dataset inherits all of its parents' content. Multiple parents can be entered, but they are merged in the order they were entered| <img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" /> | |parents|IDs of the dataset's parents. The dataset inherits all of its parents' content. Multiple parents can be entered, but they are merged in the order they were entered| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|tags |Dataset user tags. The dataset can be labeled, which can be useful for organizing datasets| <img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |tags |Dataset user tags. The dataset can be labeled, which can be useful for organizing datasets| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
:::important :::important
clearml-data works in a stateful mode so once a new dataset is created, the following commands clearml-data works in a stateful mode so once a new dataset is created, the following commands
@ -92,7 +98,7 @@ do not require the `--id` flag.
<br/> <br/>
#### Add Files to Dataset ### Add Files
```bash ```bash
clearml-data add --id <dataset_id> --files <filenames/folders_to_add> clearml-data add --id <dataset_id> --files <filenames/folders_to_add>
``` ```
@ -102,15 +108,15 @@ It's possible to add individual files or complete folders.<br/>
|Name|Description|Optional| |Name|Description|Optional|
|---|---|---| |---|---|---|
|id | Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" /> | |id | Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|files|Files / folders to add. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json` | <img src="/docs/latest/icons/ico-optional-no.svg" className="icon size-md center-md" /> | |files|Files / folders to add. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json` | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|dataset-folder | Dataset base folder to add the files to in the dataset. Default: dataset root| <img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" /> | |dataset-folder | Dataset base folder to add the files to in the dataset. Default: dataset root| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|non-recursive | Disable recursive scan of files | <img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" /> | |non-recursive | Disable recursive scan of files | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|verbose | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |verbose | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
<br/> <br/>
#### Remove Files From Dataset ### Remove Files
```bash ```bash
clearml-data remove --id <dataset_id_to_remove_from> --files <filenames/folders_to_remove> clearml-data remove --id <dataset_id_to_remove_from> --files <filenames/folders_to_remove>
``` ```
@ -119,33 +125,14 @@ clearml-data remove --id <dataset_id_to_remove_from> --files <filenames/folders_
|Name|Description|Optional| |Name|Description|Optional|
|---|---|---| |---|---|---|
|id | Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" /> | |id | Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|files | Files / folders to remove (wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`). Notice: file path is the path within the dataset, not the local path.| <img src="/docs/latest/icons/ico-optional-no.svg" className="icon size-md center-md" /> | |files | Files / folders to remove (wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`). Notice: file path is the path within the dataset, not the local path.| <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|non-recursive | Disable recursive scan of files | <img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" /> | |non-recursive | Disable recursive scan of files | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|verbose | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |verbose | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
<br/> <br/>
#### Finalize Dataset ### Upload Dataset Content
```bash
clearml-data close --id <dataset_id>
```
Finalizes the dataset and makes it ready to be consumed.
It automatically uploads all files that were not previously uploaded.
Once a dataset is finalized, it can no longer be modified.
**Parameters**
|Name|Description|Optional|
|---|---|---|
|id| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" /> |
|storage| Remote storage to use for the dataset files. Default: files_server | <img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" /> |
|disable-upload | Disable automatic upload when closing the dataset | <img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" /> |
|verbose | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
<br/>
#### Upload Dataset' Content
```bash ```bash
clearml-data upload [--id <dataset_id>] [--storage <upload_destination>] clearml-data upload [--id <dataset_id>] [--storage <upload_destination>]
``` ```
@ -157,13 +144,32 @@ medium by entering an upload destination, such as `s3://bucket`, `gs://`, `azure
|Name|Description|Optional| |Name|Description|Optional|
|---|---|---| |---|---|---|
|id| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" /> | |id| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|storage| Remote storage to use for the dataset files. Default: files_server | <img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" /> | |storage| Remote storage to use for the dataset files. Default: files_server | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|verbose | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |verbose | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
<br/> <br/>
#### Sync Local Folder ### Finalize Dataset
```bash
clearml-data close --id <dataset_id>
```
Finalizes the dataset and makes it ready to be consumed.
It automatically uploads all files that were not previously uploaded.
Once a dataset is finalized, it can no longer be modified.
**Parameters**
|Name|Description|Optional|
|---|---|---|
|id| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|storage| Remote storage to use for the dataset files. Default: files_server | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|disable-upload | Disable automatic upload when closing the dataset | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|verbose | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
<br/>
### Sync Local Folder
``` ```
clearml-data sync [--id <dataset_id] --folder <folder_location> [--parents '<parent_id>']` clearml-data sync [--id <dataset_id] --folder <folder_location> [--parents '<parent_id>']`
``` ```
@ -180,19 +186,19 @@ This command also uploads the data and finalizes the dataset automatically.
|Name|Description|Optional| |Name|Description|Optional|
|---|---|---| |---|---|---|
|id| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" /> | |id| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|folder|Local folder to sync. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`|<img src="/docs/latest/icons/ico-optional-no.svg" className="icon size-md center-md" />| |folder|Local folder to sync. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|storage|Remote storage to use for the dataset files. Default: files_server |<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |storage|Remote storage to use for the dataset files. Default: files_server |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|parents|IDs of the dataset's parents (i.e. merge all parents). All modifications made to the folder since the parents were synced will be reflected in the dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |parents|IDs of the dataset's parents (i.e. merge all parents). All modifications made to the folder since the parents were synced will be reflected in the dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|project|If creating a new dataset, specify the dataset's project name|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |project|If creating a new dataset, specify the dataset's project name|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|name|If creating a new dataset, specify the dataset's name|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |name|If creating a new dataset, specify the dataset's name|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|tags|Dataset user tags|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |tags|Dataset user tags|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|skip-close|Do not auto close dataset after syncing folders|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |skip-close|Do not auto close dataset after syncing folders|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|verbose | Verbose reporting |<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |verbose | Verbose reporting |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
<br/> <br/>
#### List Dataset Content ### List Dataset Content
```bash ```bash
clearml-data list [--id <dataset_id>] clearml-data list [--id <dataset_id>]
``` ```
@ -201,15 +207,15 @@ clearml-data list [--id <dataset_id>]
|Name|Description|Optional| |Name|Description|Optional|
|---|---|---| |---|---|---|
|id|Dataset ID whose contents will be shown (alternatively, use project / name combination). Default: previously accessed dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |id|Dataset ID whose contents will be shown (alternatively, use project / name combination). Default: previously accessed dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|project|Specify dataset project name (if used instead of ID, dataset name is also required)|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |project|Specify dataset project name (if used instead of ID, dataset name is also required)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|name|Specify dataset name (if used instead of ID, dataset project is also required)|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |name|Specify dataset name (if used instead of ID, dataset project is also required)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|filter|Filter files based on folder / wildcard. Multiple filters are supported. Example: `folder/date_*.json folder/sub-folder`|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |filter|Filter files based on folder / wildcard. Multiple filters are supported. Example: `folder/date_*.json folder/sub-folder`|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|modified|Only list file changes (add / remove / modify) introduced in this version|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |modified|Only list file changes (add / remove / modify) introduced in this version|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
<br/> <br/>
#### Delete a Dataset ### Delete Dataset
``` ```
clearml-data delete [--id <dataset_id_to_delete>] clearml-data delete [--id <dataset_id_to_delete>]
``` ```
@ -221,12 +227,12 @@ This does not work on datasets with children.
|Name|Description|Optional| |Name|Description|Optional|
|---|---|---| |---|---|---|
|id|ID of dataset to be deleted. Default: previously created / accessed dataset that hasn't been finalized yet|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |id|ID of dataset to be deleted. Default: previously created / accessed dataset that hasn't been finalized yet|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|force|Force dataset deletion even if other dataset versions depend on it|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|| |force|Force dataset deletion even if other dataset versions depend on it|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />||
<br/> <br/>
#### Search for a Dataset ### Search for a Dataset
``` ```
clearml-data search [--name <name>] [--project <project_name>] [--tags <tag>] clearml-data search [--name <name>] [--project <project_name>] [--tags <tag>]
``` ```
@ -245,98 +251,111 @@ Datasets can be searched by project, name, ID, and tags.
<br/> <br/>
### Python API ### Compare Two Datasets
All API commands should be imported with<br/> ```
`from clearml import Dataset` clearml-data compare [--source SOURCE] [--target TARGET]
```
Compare two datasets (target vs. source). The command returns a comparison summary that looks like this:
#### `Dataset.get(dataset_id=DS_ID).get_local_copy()` ```
Comparison summary: 4 files removed, 3 files modified, 0 files added
Returns a path to dataset in cache, and downloads it if it is not already in cache. ```
**Parameters** **Parameters**
|Name|Description|Optional| |Name|Description|Optional|
|---|---|---| |---|---|---|
|use_soft_links|If True, use soft links. Default: False on Windows, True on Posix systems|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |source|Source dataset id (used as baseline)|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|raise_on_error|If True, raise exception if dataset merging failed on any file|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |target|Target dataset id (compare against the source baseline dataset)|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|verbose|Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
<br/> ### Merge Datasets
#### `Dataset.get(dataset_id=DS_ID).get_mutable_local_copy()` ```
clearml-data squash --name NAME --ids [IDS [IDS ...]]
```
Downloads the dataset to a specific folder (non-cached). If the folder already has contents, specify whether to overwrite Squash (merge) multiple datasets into a single dataset version.
its contents with the dataset contents.
**Parameters** **Parameters**
|Name|Description|Optional| |Name|Description|Optional|
|---|---|---| |---|---|---|
|target_folder|Local target folder for the writable copy of the dataset|<img src="/docs/latest/icons/ico-optional-no.svg" className="icon size-md center-md" />| |name|Create squashed dataset name|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|overwrite|If True, recursively delete the contents of the target folder before creating a copy of the dataset. If False (default) and target folder contains files, raise exception or return None|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |ids|Source dataset IDs to squash (merge down)|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|raise_on_error|If True, raise exception if dataset merging failed on any file|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |storage|Remote storage to use for the dataset files. Default: files_server |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|verbose|Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
<br/> ### Verify Dataset
#### `Dataset.create()` ```
clearml-data verify [--id ID] [--folder FOLDER]
```
Create a new dataset. Verify that the dataset content matches the data from the local source.
Parent datasets can be specified, and the new dataset inherits all of its parent's content. Multiple dataset parents can
be listed. Merging of parent datasets is done based on the list's order, where each parent can override overlapping files
in the previous parent dataset.
**Parameters** **Parameters**
|Name|Description|Optional| |Name|Description|Optional|
|---|---|---| |---|---|---|
|dataset_name|Name of the new dataset|<img src="/docs/latest/icons/ico-optional-no.svg" className="icon size-md center-md" />| |id|Specify dataset ID. Default: previously created/accessed dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|dataset_project|The project containing the dataset. If not specified, infer project name from parent datasets. If there is no parent dataset, then this value is required|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |folder|Specify dataset local copy (if not provided the local cache folder will be verified)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|parent_datasets|Expand a parent dataset by adding / removing files|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |filesize| If True, only verify file size and skip hash checks (default: false)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|use_current_task|If True, the dataset is created on the current Task. Default: False|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |verbose|Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
<br/> ### Get a Dataset
#### `Dataset.add_files()` ```
clearml-data get [--id ID] [--copy COPY] [--link LINK] [--overwrite]
```
Add files or folder into the current dataset. Get a local copy of a dataset. By default, you get a read only cached folder, but you can get a mutable copy by using the
`--copy` flag.
**Parameters** **Parameters**
|Name|Description|Optional| |Name|Description|Optional|
|---|---|---| |---|---|---|
|path|Add a folder / file to the dataset|<img src="/docs/latest/icons/ico-optional-no.svg" className="icon size-md center-md" />| |id| Specify dataset ID. Default: previously created / accessed dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|wildcard|Add only a specific set of files based on wildcard matching. Wildcard matching can be a single string or a list of wildcards, for example: `~/data/*.jpg`, `~/data/json`|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |copy| Get a writable copy of the dataset to a specific output folder|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|local_base_folder|Files will be located based on their relative path from local_base_folder|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |link| Create a soft link (not supported on Windows) to a read-only cached folder containing the dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|dataset_path|Where in the dataset the folder / files should be located|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |overwrite| If True, overwrite the target folder|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|recursive|If True, match all wildcard files recursively|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |verbose| Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|verbose| If True, print to console files added / modified|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
<br/> ### Publish a Dataset
```
clearml-data publish --id ID
```
Publish the dataset for public use. The dataset must be [finalized](#finalize-dataset) before it is published.
#### `Dataset.upload()`
Start file uploading, the function returns when all files are uploaded.
**Parameters** **Parameters**
|Name|Description|Optional| |Name|Description|Optional|
|---|---|---| |---|---|---|
|show_progress|If True, show upload progress bar|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |id| The dataset task id to be published.|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|verbose|If True, print verbose progress report|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
|output_url|Target storage for the compressed dataset (default: file server). Examples: `s3://bucket/data`, `gs://bucket/data` , `azure://bucket/data`, `/mnt/share/data` |<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
|compression|Compression algorithm for the Zipped dataset file (default: ZIP_DEFLATED)|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
<br/>
#### `Dataset.finalize()`
Closes the dataset and marks it as *Completed*. After a dataset has been closed, it can no longer be modified.
Before closing a dataset, its files must first be uploaded.
**Parameters** ## Python API
|Name|Description|Optional| It's also possible to manage a dataset using ClearML Data's python interface.
|---|---|---|
|verbose|If True, print verbose progress report|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
|raise_on_error|If True, raise exception if dataset finalizing failed|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
All API commands should be imported with:
```python
from clearml import Dataset
```
See all API commands in the [Dataset](references/sdk/dataset.md) reference page.
## Tutorials
Take a look at the ClearML Data tutorials:
* [Dataset Management with CLI and SDK](guides/data%20management/data_man_cifar_classification)
* [Dataset Management with CLI](guides/data%20management/data_man_simple)
* [Folder Sync with CLI](guides/data%20management/data_man_folder_sync)