--- title: CLI --- :::important This page covers `clearml-data`, ClearML's file-based data management solution. See [Hyper-Datasets](../hyperdatasets/overview.md) for ClearML's advanced queryable dataset management solution. ::: The `clearml-data` utility is a CLI tool for controlling and managing your data with ClearML. The following page provides a reference to `clearml-data`'s CLI commands. ## create Creates a new dataset. ```bash clearml-data create [-h] [--parents [PARENTS [PARENTS ...]]] [--project PROJECT] --name NAME [--tags [TAGS [TAGS ...]]] ``` **Parameters**

|Name|Description|Optional| |---|---|---| |`--name` |Dataset's name|

| |`--project`|Dataset's project|

| |`--parents`|IDs of the dataset's parents. The dataset inherits all of its parents' content. Multiple parents can be entered, but they are merged in the order they were entered| Yes

| |`--tags` |Dataset user tags. The dataset can be labeled, which can be useful for organizing datasets| Yes

:::tip Dataset ID * To locate a dataset's ID, go to the dataset task's info panel in the [WebApp](../webapp/webapp_exp_track_visual.md). In the top of the panel, to the right of the dataset task name, click `ID` and the dataset ID appears. * clearml-data works in a stateful mode so once a new dataset is created, the following commands do not require the `--id` flag. :::
## add Add individual files or complete folders to the dataset. ```bash clearml-data add [-h] [--id ID] [--dataset-folder DATASET_FOLDER] [--files [FILES [FILES ...]]] [--links [LINKS [LINKS ...]]] [--non-recursive] [--verbose] ``` **Parameters**

|Name|Description|Optional| |---|---|---| |`--id` | Dataset's ID. Default: previously created / accessed dataset| Yes

| |`--files`| Files / folders to add. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`. Items will be uploaded to the dataset’s designated storage. | Yes

| |`--links`| Files / folders link to add. Supports s3, gs, azure links. Example: `s3://bucket/data` `azure://bucket/folder`. Items remain in their original location. | Yes

| |`--dataset-folder` | Dataset base folder to add the files to in the dataset. Default: dataset root| Yes

| |`--non-recursive` | Disable recursive scan of files | Yes

| |`--verbose` | Verbose reporting | Yes

## remove Remove files/links from the dataset. ```bash clearml-data remove [-h] [--id ID] [--files [FILES [FILES ...]]] [--non-recursive] [--verbose] ``` **Parameters**

|Name|Description|Optional| |---|---|---| |`--id` | Dataset's ID. Default: previously created / accessed dataset| Yes

| |`--files` | Files / folders to remove (wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`). Notice: file path is the path within the dataset, not the local path. For links, you can specify their URL (e.g. `s3://bucket/data`) |

| |`--non-recursive` | Disable recursive scan of files | Yes

| |`--verbose` | Verbose reporting | Yes

## upload Upload the local dataset changes to the server. By default, it's uploaded to the [ClearML Server](../deploying_clearml/clearml_server.md). It's possible to specify a different storage medium by entering an upload destination, such as `s3://bucket`, `gs://`, `azure://`, `/mnt/shared/`. ```bash clearml-data upload [-h] [--id ID] [--storage STORAGE] [--chunk-size CHUNK_SIZE] [--verbose] ``` **Parameters**

|Name|Description|Optional| |---|---|---| |`--id`| Dataset's ID. Default: previously created / accessed dataset| Yes

| |`--storage`| Remote storage to use for the dataset files. Default: files_server | Yes

| |`--chunk-size`| Set dataset artifact upload chunk size in MB. Default 512, (pass -1 for a single chunk). Example: 512, dataset will be split and uploaded in 512 MB chunks. | Yes

| |`--verbose` | Verbose reporting | Yes

## close Finalize the dataset and makes it ready to be consumed. This automatically uploads all files that were not previously uploaded. Once a dataset is finalized, it can no longer be modified. ```bash clearml-data close [-h] [--id ID] [--storage STORAGE] [--disable-upload] [--chunk-size CHUNK_SIZE] [--verbose] ``` **Parameters**

|Name|Description|Optional| |---|---|---| |`--id`| Dataset's ID. Default: previously created / accessed dataset| Yes

| |`--storage`| Remote storage to use for the dataset files. Default: files_server | Yes

| |`--disable-upload` | Disable automatic upload when closing the dataset | Yes

| |`--chunk-size`| Set dataset artifact upload chunk size in MB. Default 512, (pass -1 for a single chunk). Example: 512, dataset will be split and uploaded in 512 MB chunks. | Yes

| |`--verbose` | Verbose reporting | Yes

## sync Sync a folder's content with ClearML. This option is useful in case a user has a single point of truth (i.e. a folder) which updates from time to time. Once an update should be reflected in ClearML's system, call `clearml-data sync` and pass the folder path, and the changes (either file addition, modification and removal) will be reflected in ClearML. This command also uploads the data and finalizes the dataset automatically. ```bash clearml-data sync [-h] [--id ID] [--dataset-folder DATASET_FOLDER] --folder FOLDER [--parents [PARENTS [PARENTS ...]]] [--project PROJECT] [--name NAME] [--tags [TAGS [TAGS ...]]] [--storage STORAGE] [--skip-close] [--chunk-size CHUNK_SIZE] [--verbose] ``` **Parameters**

|Name|Description|Optional| |---|---|---| |`--id`| Dataset's ID. Default: previously created / accessed dataset| Yes

| |`--dataset-folder`|Dataset base folder to add the files to (default: Dataset root)| Yes

| |`--folder`|Local folder to sync. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`|

| |`--storage`|Remote storage to use for the dataset files. Default: files_server | Yes

| |`--parents`|IDs of the dataset's parents (i.e. merge all parents). All modifications made to the folder since the parents were synced will be reflected in the dataset| Yes

| |`--project`|If creating a new dataset, specify the dataset's project name| Yes

| |`--name`|If creating a new dataset, specify the dataset's name| Yes

| |`--tags`|Dataset user tags| Yes

| |`--skip-close`|Do not auto close dataset after syncing folders| Yes

| |`--chunk-size`| Set dataset artifact upload chunk size in MB. Default 512, (pass -1 for a single chunk). Example: 512, dataset will be split and uploaded in 512 MB chunks. | Yes

| |`--verbose` | Verbose reporting | Yes

## list List a dataset's contents. ```bash clearml-data list [-h] [--id ID] [--project PROJECT] [--name NAME] [--filter [FILTER [FILTER ...]]] [--modified] ``` **Parameters**

|Name|Description|Optional| |---|---|---| |`--id`|Dataset ID whose contents will be shown (alternatively, use project / name combination). Default: previously accessed dataset| Yes

| |`--project`|Specify dataset project name (if used instead of ID, dataset name is also required)| Yes

| |`--name`|Specify dataset name (if used instead of ID, dataset project is also required)| Yes

| |`--filter`|Filter files based on folder / wildcard. Multiple filters are supported. Example: `folder/date_*.json folder/sub-folder`| Yes

| |`--modified`|Only list file changes (add / remove / modify) introduced in this version| Yes

## delete Delete an entire dataset from ClearML. This can also be used to delete a newly created dataset. This does not work on datasets with children. ```bash clearml-data delete [-h] [--id ID] [--force] ``` **Parameters**

|Name|Description|Optional| |---|---|---| |`--id`|ID of dataset to be deleted. Default: previously created / accessed dataset that hasn't been finalized yet| Yes

| |`--force`|Force dataset deletion even if other dataset versions depend on it| Yes

## search Search datasets in the system by project, name, ID, and/or tags. Returns list of all datasets in the system that match the search request, sorted by creation time. ```bash clearml-data search [-h] [--ids [IDS [IDS ...]]] [--project PROJECT] [--name NAME] [--tags [TAGS [TAGS ...]]] ``` **Parameters**

|Name|Description|Optional| |---|---|---| |`--ids`|A list of dataset IDs|

| |`--project`|The project name of the datasets|

| |`--name`|A dataset name or a partial name to filter datasets by|

| |`--tags`|A list of dataset user tags|

## compare Compare two datasets (target vs. source). The command returns a comparison summary that looks like this: `Comparison summary: 4 files removed, 3 files modified, 0 files added` ```bash clearml-data compare [-h] --source SOURCE --target TARGET [--verbose] ``` **Parameters**

|Name|Description|Optional| |---|---|---| |`--source`|Source dataset ID (used as baseline)|

| |`--target`|Target dataset ID (compare against the source baseline dataset)|

| |`--verbose`|Verbose report all file changes (instead of summary)| Yes

## squash Squash multiple datasets into a single dataset version (merge down). ```bash clearml-data squash [-h] --name NAME --ids [IDS [IDS ...]] [--storage STORAGE] [--verbose] ``` **Parameters**

|Name|Description|Optional| |---|---|---| |`--name`|Create squashed dataset name|

| |`--ids`|Source dataset IDs to squash (merge down)|

| |`--storage`|Remote storage to use for the dataset files. Default: files_server | Yes

| |`--verbose`|Verbose report all file changes (instead of summary)| Yes

## verify Verify that the dataset content matches the data from the local source. ```bash clearml-data verify [-h] [--id ID] [--folder FOLDER] [--filesize] [--verbose] ``` **Parameters**

|Name|Description|Optional| |---|---|---| |`--id`|Specify dataset ID. Default: previously created/accessed dataset| Yes

| |`--folder`|Specify dataset local copy (if not provided the local cache folder will be verified)| Yes

| |`--filesize`| If `True`, only verify file size and skip hash checks (default: `False`)| Yes

| |`--verbose`|Verbose report all file changes (instead of summary)| Yes

## get Get a local copy of a dataset. By default, you get a read only cached folder, but you can get a mutable copy by using the `--copy` flag. ```bash clearml-data get [-h] [--id ID] [--copy COPY] [--link LINK] [--part PART] [--num-parts NUM_PARTS] [--overwrite] [--verbose] ``` **Parameters**

|Name|Description|Optional| |---|---|---| |`--id`| Specify dataset ID. Default: previously created / accessed dataset| Yes

| |`--copy`| Get a writable copy of the dataset to a specific output folder| Yes

| |`--link`| Create a soft link (not supported on Windows) to a read-only cached folder containing the dataset| Yes

| |`--part`|Retrieve a partial copy of the dataset. Part number (0 to `--num-parts`-1) of total parts `--num-parts`.| Yes

| |`--num-parts`|Total number of parts to divide the dataset into. Notice, minimum retrieved part is a single chunk in a dataset (or its parents). Example: Dataset gen4, with 3 parents, each with a single chunk, can be divided into 4 parts | Yes

| |`--overwrite`| If `True`, overwrite the target folder| Yes

| |`--verbose`| Verbose report all file changes (instead of summary)| Yes

## publish Publish the dataset for public use. The dataset must be [finalized](#close) before it is published. ```bash clearml-data publish [-h] --id ID ``` **Parameters**

|Name|Description|Optional| |---|---|---| |`--id`| The dataset task ID to be published.|