2022-06-30 09:27:45 +03:00

21 KiB
Raw Blame History


:::important This page covers clearml-data, ClearML's file-based data management solution. See Hyper-Datasets for ClearML's advanced queryable dataset management solution. :::

The clearml-data utility is a CLI tool for controlling and managing your data with ClearML.

The following page provides a reference to clearml-data's CLI commands.


Creates a new dataset.

clearml-data create [-h] [--parents [PARENTS [PARENTS ...]]] [--project PROJECT] 
                    --name NAME [--version VERSION] [--output-uri OUTPUT_URI] 
                    [--tags [TAGS [TAGS ...]]]


Name Description Optional
--name Dataset's name No
--project Dataset's project No
--version Dataset version. If not specified a version will automatically be assigned Yes
--parents IDs of the dataset's parents. The dataset inherits all of its parents' content. Multiple parents can be entered, but they are merged in the order they were entered Yes
--output-uri Sets where dataset and its previews are uploaded to Yes
--tags Dataset user tags. The dataset can be labeled, which can be useful for organizing datasets Yes

:::tip Dataset ID

  • To locate a dataset's ID, go to the dataset task's info panel in the WebApp. In the top of the panel, to the right of the dataset task name, click ID and the dataset ID appears.

  • clearml-data works in a stateful mode so once a new dataset is created, the following commands do not require the --id flag. :::


Add individual files or complete folders to the dataset.

clearml-data add [-h] [--id ID] [--dataset-folder DATASET_FOLDER]
                 [--files [FILES [FILES ...]]] [--links [LINKS [LINKS ...]]] 
                 [--non-recursive] [--verbose]


Name Description Optional
--id Dataset's ID. Default: previously created / accessed dataset Yes
--files Files / folders to add. Wildcard selection is supported, for example: ~/data/*.jpg ~/data/json. Items will be uploaded to the datasets designated storage. Yes
--links Files / folders link to add. Supports s3, gs, azure links. Example: s3://bucket/data azure://bucket/folder. Items remain in their original location. Yes
--dataset-folder Dataset base folder to add the files to in the dataset. Default: dataset root Yes
--non-recursive Disable recursive scan of files Yes
--verbose Verbose reporting Yes


Remove files/links from the dataset.

clearml-data remove [-h] [--id ID] [--files [FILES [FILES ...]]] 
                    [--non-recursive] [--verbose]


Name Description Optional
--id Dataset's ID. Default: previously created / accessed dataset Yes
--files Files / folders to remove (wildcard selection is supported, for example: ~/data/*.jpg ~/data/json). Notice: file path is the path within the dataset, not the local path. For links, you can specify their URL (e.g. s3://bucket/data) No
--non-recursive Disable recursive scan of files Yes
--verbose Verbose reporting Yes


Upload the local dataset changes to the server. By default, it's uploaded to the ClearML Server. It's possible to specify a different storage medium by entering an upload destination, such as s3://bucket, gs://, azure://, /mnt/shared/.

clearml-data upload [-h] [--id ID] [--storage STORAGE] [--chunk-size CHUNK_SIZE] 


Name Description Optional
--id Dataset's ID. Default: previously created / accessed dataset Yes
--storage Remote storage to use for the dataset files. Default: files_server Yes
--chunk-size Set dataset artifact upload chunk size in MB. Default 512, (pass -1 for a single chunk). Example: 512, dataset will be split and uploaded in 512 MB chunks. Yes
--verbose Verbose reporting Yes


Finalize the dataset and makes it ready to be consumed. This automatically uploads all files that were not previously uploaded. Once a dataset is finalized, it can no longer be modified.

clearml-data close [-h] [--id ID] [--storage STORAGE] [--disable-upload]
                   [--chunk-size CHUNK_SIZE] [--verbose]


Name Description Optional
--id Dataset's ID. Default: previously created / accessed dataset Yes
--storage Remote storage to use for the dataset files. Default: files_server Yes
--disable-upload Disable automatic upload when closing the dataset Yes
--chunk-size Set dataset artifact upload chunk size in MB. Default 512, (pass -1 for a single chunk). Example: 512, dataset will be split and uploaded in 512 MB chunks. Yes
--verbose Verbose reporting Yes


Sync a folder's content with ClearML. This option is useful in case a user has a single point of truth (i.e. a folder) which updates from time to time.

Once an update should be reflected in ClearML's system, call clearml-data sync and pass the folder path, and the changes (either file addition, modification and removal) will be reflected in ClearML.

This command also uploads the data and finalizes the dataset automatically.

clearml-data sync [-h] [--id ID] [--dataset-folder DATASET_FOLDER] --folder FOLDER
                  [--parents [PARENTS [PARENTS ...]]] [--project PROJECT] [--name NAME]
                  [--version VERSION] [--output-uri OUTPUT_URI] [--tags [TAGS [TAGS ...]]]
                  [--storage STORAGE] [--skip-close] [--chunk-size CHUNK_SIZE] [--verbose]


Name Description Optional
--id Dataset's ID. Default: previously created / accessed dataset Yes
--dataset-folder Dataset base folder to add the files to (default: Dataset root) Yes
--folder Local folder to sync. Wildcard selection is supported, for example: ~/data/*.jpg ~/data/json No
--storage Remote storage to use for the dataset files. Default: files server Yes
--parents IDs of the dataset's parents (i.e. merge all parents). All modifications made to the folder since the parents were synced will be reflected in the dataset Yes
--project If creating a new dataset, specify the dataset's project name Yes
--name If creating a new dataset, specify the dataset's name Yes
--version Specify the datasets version. Default: 1.0.0 Yes
--tags Dataset user tags Yes
--skip-close Do not auto close dataset after syncing folders Yes
--chunk-size Set dataset artifact upload chunk size in MB. Default 512, (pass -1 for a single chunk). Example: 512, dataset will be split and uploaded in 512 MB chunks. Yes
--verbose Verbose reporting Yes


List a dataset's contents.

clearml-data list [-h] [--id ID] [--project PROJECT] [--name NAME] [--version VERSION]
                  [--filter [FILTER [FILTER ...]]] [--modified]


Name Description Optional
--id Dataset ID whose contents will be shown (alternatively, use project / name combination). Default: previously accessed dataset Yes
--project Specify dataset project name (if used instead of ID, dataset name is also required) Yes
--name Specify dataset name (if used instead of ID, dataset project is also required) Yes
--version Specify dataset version. Default: most recent version Yes
--filter Filter files based on folder / wildcard. Multiple filters are supported. Example: folder/date_*.json folder/sub-folder Yes
--modified Only list file changes (add / remove / modify) introduced in this version Yes


Sets the description of an existing dataset.

clearml-data set-description [-h] [--id ID] [--description DESCRIPTION]


Name Description Optional
--id Datasets ID No
--description Description to be set No


Deletes dataset(s). Pass any of the attributes of the dataset(s) you want to delete. Multiple datasets matching the request will raise an exception, unless you pass --entire-dataset and --force. In this case, all matching datasets will be deleted.

If a dataset is a parent to a dataset(s), you must pass --force in order to delete it.

:::warning Deleting a parent dataset may cause child datasets to lose data! :::

clearml-data delete [-h] [--id ID] [--project PROJECT] [--name NAME] 
                    [--version VERSION] [--force] [--entire-dataset]


Name Description Optional
--id ID of the dataset to delete (alternatively, use project / name combination). Yes
--project Specify dataset project name (if used instead of ID, dataset name is also required) Yes
--name Specify dataset name (if used instead of ID, dataset project is also required) Yes
--version Specify dataset version Yes
-force Force dataset deletion even if other dataset versions depend on it. Must also be used if --entire-dataset flag is used Yes
--entire-dataset Delete all found datasets Yes


Rename a dataset (and all of its versions).

clearml-data rename [-h] --new-name NEW_NAME --project PROJECT --name NAME


Name Description Optional
--new-name The new name of the dataset No
--project The project the dataset to be renamed belongs to No
--name The current name of the dataset(s) to be renamed No


Moves a dataset to another project

clearml-data move [-h] --new-project NEW_PROJECT --project PROJECT --name NAME


Name Description Optional
--new-project The new project of the dataset No
--project The current project the dataset to be move belongs to No
--name The name of the dataset to be moved No

Search datasets in the system by project, name, ID, and/or tags.

Returns list of all datasets in the system that match the search request, sorted by creation time.

clearml-data search [-h] [--ids [IDS [IDS ...]]] [--project PROJECT] 
                    [--name NAME] [--tags [TAGS [TAGS ...]]]


Name Description Optional
--ids A list of dataset IDs Yes
--project The project name of the datasets Yes
--name A dataset name or a partial name to filter datasets by Yes
--tags A list of dataset user tags Yes


Compare two datasets (target vs. source). The command returns a comparison summary that looks like this: Comparison summary: 4 files removed, 3 files modified, 0 files added

clearml-data compare [-h] --source SOURCE --target TARGET [--verbose]


Name Description Optional
--source Source dataset ID (used as baseline) No
--target Target dataset ID (compare against the source baseline dataset) No
--verbose Verbose report all file changes (instead of summary) Yes


Squash multiple datasets into a single dataset version (merge down).

clearml-data squash [-h] --name NAME --ids [IDS [IDS ...]] [--storage STORAGE] [--verbose]


Name Description Optional
--name Create squashed dataset name No
--ids Source dataset IDs to squash (merge down) No
--storage Remote storage to use for the dataset files. Default: files_server Yes
--verbose Verbose report all file changes (instead of summary) Yes


Verify that the dataset content matches the data from the local source.

clearml-data verify [-h] [--id ID] [--folder FOLDER] [--filesize] [--verbose]


Name Description Optional
--id Specify dataset ID. Default: previously created/accessed dataset Yes
--folder Specify dataset local copy (if not provided the local cache folder will be verified) Yes
--filesize If True, only verify file size and skip hash checks (default: False) Yes
--verbose Verbose report all file changes (instead of summary) Yes


Get a local copy of a dataset. By default, you get a read only cached folder, but you can get a mutable copy by using the --copy flag.

clearml-data get [-h] [--id ID] [--copy COPY] [--link LINK] [--part PART]
                 [--num-parts NUM_PARTS] [--overwrite] [--verbose]


Name Description Optional
--id Specify dataset ID. Default: previously created / accessed dataset Yes
--copy Get a writable copy of the dataset to a specific output folder Yes
--link Create a soft link (not supported on Windows) to a read-only cached folder containing the dataset Yes
--part Retrieve a partial copy of the dataset. Part number (0 to --num-parts-1) of total parts --num-parts. Yes
--num-parts Total number of parts to divide the dataset into. Notice, minimum retrieved part is a single chunk in a dataset (or its parents). Example: Dataset gen4, with 3 parents, each with a single chunk, can be divided into 4 parts Yes
--overwrite If True, overwrite the target folder Yes
--verbose Verbose report all file changes (instead of summary) Yes


Publish the dataset for public use. The dataset must be finalized before it is published.

clearml-data publish [-h] --id ID


Name Description Optional
--id The dataset task ID to be published. No