clearml-docs/clearml_data_cli.md at 2e3c9862cb1e92ec3b93377de70f31e88e5d5d19

mirror of https://github.com/clearml/clearml-docs synced 2025-06-26 18:17:44 +00:00

2022-05-10 10:46:15 +03:00

18 KiB

Raw Blame History

title
CLI

:::important This page covers clearml-data, ClearML's file-based data management solution. See Hyper-Datasets for ClearML's advanced queryable dataset management solution. :::

The clearml-data utility is a CLI tool for controlling and managing your data with ClearML.

The following page provides a reference to clearml-data's CLI commands.

create

Creates a new dataset.

clearml-data create [-h] [--parents [PARENTS [PARENTS ...]]] [--project PROJECT] 
                    --name NAME [--tags [TAGS [TAGS ...]]]

Parameters

Name	Description	Optional
`--name`	Dataset's name
`--project`	Dataset's project
`--parents`	IDs of the dataset's parents. The dataset inherits all of its parents' content. Multiple parents can be entered, but they are merged in the order they were entered
`--tags`	Dataset user tags. The dataset can be labeled, which can be useful for organizing datasets

:::tip Dataset ID

To locate a dataset's ID, go to the dataset task's info panel in the WebApp. In the top of the panel, to the right of the dataset task name, click ID and the dataset ID appears.
clearml-data works in a stateful mode so once a new dataset is created, the following commands do not require the --id flag. :::

add

Add individual files or complete folders to the dataset.

clearml-data add [-h] [--id ID] [--dataset-folder DATASET_FOLDER]
                 [--files [FILES [FILES ...]]] [--links [LINKS [LINKS ...]]] 
                 [--non-recursive] [--verbose]

Parameters

Name	Description	Optional
`--id`	Dataset's ID. Default: previously created / accessed dataset
`--files`	Files / folders to add. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`. Items will be uploaded to the dataset’s designated storage.
`--links`	Files / folders link to add. Supports s3, gs, azure links. Example: `s3://bucket/data` `azure://bucket/folder`. Items remain in their original location.
`--dataset-folder`	Dataset base folder to add the files to in the dataset. Default: dataset root
`--non-recursive`	Disable recursive scan of files
`--verbose`	Verbose reporting

remove

Remove files/links from the dataset.

clearml-data remove [-h] [--id ID] [--files [FILES [FILES ...]]] 
                    [--non-recursive] [--verbose]

Parameters

Name	Description	Optional
`--id`	Dataset's ID. Default: previously created / accessed dataset
`--files`	Files / folders to remove (wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`). Notice: file path is the path within the dataset, not the local path. For links, you can specify their URL (e.g. `s3://bucket/data`)
`--non-recursive`	Disable recursive scan of files
`--verbose`	Verbose reporting

upload

Upload the local dataset changes to the server. By default, it's uploaded to the ClearML Server. It's possible to specify a different storage medium by entering an upload destination, such as s3://bucket, gs://, azure://, /mnt/shared/.

clearml-data upload [-h] [--id ID] [--storage STORAGE] [--chunk-size CHUNK_SIZE] 
                    [--verbose]

Parameters

Name	Description	Optional
`--id`	Dataset's ID. Default: previously created / accessed dataset
`--storage`	Remote storage to use for the dataset files. Default: files_server
`--chunk-size`	Set dataset artifact upload chunk size in MB. Default 512, (pass -1 for a single chunk). Example: 512, dataset will be split and uploaded in 512 MB chunks.
`--verbose`	Verbose reporting

close

Finalize the dataset and makes it ready to be consumed. This automatically uploads all files that were not previously uploaded. Once a dataset is finalized, it can no longer be modified.

clearml-data close [-h] [--id ID] [--storage STORAGE] [--disable-upload]
                   [--chunk-size CHUNK_SIZE] [--verbose]

Parameters

Name	Description	Optional
`--id`	Dataset's ID. Default: previously created / accessed dataset
`--storage`	Remote storage to use for the dataset files. Default: files_server
`--disable-upload`	Disable automatic upload when closing the dataset
`--chunk-size`	Set dataset artifact upload chunk size in MB. Default 512, (pass -1 for a single chunk). Example: 512, dataset will be split and uploaded in 512 MB chunks.
`--verbose`	Verbose reporting

sync

Sync a folder's content with ClearML. This option is useful in case a user has a single point of truth (i.e. a folder) which updates from time to time.

Once an update should be reflected in ClearML's system, call clearml-data sync and pass the folder path, and the changes (either file addition, modification and removal) will be reflected in ClearML.

This command also uploads the data and finalizes the dataset automatically.

clearml-data sync [-h] [--id ID] [--dataset-folder DATASET_FOLDER] --folder FOLDER
                  [--parents [PARENTS [PARENTS ...]]] [--project PROJECT] [--name NAME]
                  [--tags [TAGS [TAGS ...]]] [--storage STORAGE] [--skip-close]
                  [--chunk-size CHUNK_SIZE] [--verbose]

Parameters

Name	Description	Optional
`--id`	Dataset's ID. Default: previously created / accessed dataset
`--dataset-folder`	Dataset base folder to add the files to (default: Dataset root)
`--folder`	Local folder to sync. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`
`--storage`	Remote storage to use for the dataset files. Default: files_server
`--parents`	IDs of the dataset's parents (i.e. merge all parents). All modifications made to the folder since the parents were synced will be reflected in the dataset
`--project`	If creating a new dataset, specify the dataset's project name
`--name`	If creating a new dataset, specify the dataset's name
`--tags`	Dataset user tags
`--skip-close`	Do not auto close dataset after syncing folders
`--chunk-size`	Set dataset artifact upload chunk size in MB. Default 512, (pass -1 for a single chunk). Example: 512, dataset will be split and uploaded in 512 MB chunks.
`--verbose`	Verbose reporting

list

List a dataset's contents.

clearml-data list [-h] [--id ID] [--project PROJECT] [--name NAME]
                  [--filter [FILTER [FILTER ...]]] [--modified]

Parameters

Name	Description	Optional
`--id`	Dataset ID whose contents will be shown (alternatively, use project / name combination). Default: previously accessed dataset
`--project`	Specify dataset project name (if used instead of ID, dataset name is also required)
`--name`	Specify dataset name (if used instead of ID, dataset project is also required)
`--filter`	Filter files based on folder / wildcard. Multiple filters are supported. Example: `folder/date_*.json folder/sub-folder`
`--modified`	Only list file changes (add / remove / modify) introduced in this version

delete

Delete an entire dataset from ClearML. This can also be used to delete a newly created dataset.

This does not work on datasets with children.

clearml-data delete [-h] [--id ID] [--force]

Parameters

Name	Description	Optional
`--id`	ID of dataset to be deleted. Default: previously created / accessed dataset that hasn't been finalized yet
`--force`	Force dataset deletion even if other dataset versions depend on it

search

Search datasets in the system by project, name, ID, and/or tags.

Returns list of all datasets in the system that match the search request, sorted by creation time.

clearml-data search [-h] [--ids [IDS [IDS ...]]] [--project PROJECT] 
                    [--name NAME] [--tags [TAGS [TAGS ...]]]

Parameters

Name	Description	Optional
`--ids`	A list of dataset IDs
`--project`	The project name of the datasets
`--name`	A dataset name or a partial name to filter datasets by
`--tags`	A list of dataset user tags

compare

Compare two datasets (target vs. source). The command returns a comparison summary that looks like this: Comparison summary: 4 files removed, 3 files modified, 0 files added

clearml-data compare [-h] --source SOURCE --target TARGET [--verbose]

Parameters

Name	Description	Optional
`--source`	Source dataset ID (used as baseline)
`--target`	Target dataset ID (compare against the source baseline dataset)
`--verbose`	Verbose report all file changes (instead of summary)

squash

Squash multiple datasets into a single dataset version (merge down).

clearml-data squash [-h] --name NAME --ids [IDS [IDS ...]] [--storage STORAGE] [--verbose]

Parameters

Name	Description	Optional
`--name`	Create squashed dataset name
`--ids`	Source dataset IDs to squash (merge down)
`--storage`	Remote storage to use for the dataset files. Default: files_server
`--verbose`	Verbose report all file changes (instead of summary)

verify

Verify that the dataset content matches the data from the local source.

clearml-data verify [-h] [--id ID] [--folder FOLDER] [--filesize] [--verbose]

Parameters

Name	Description	Optional
`--id`	Specify dataset ID. Default: previously created/accessed dataset
`--folder`	Specify dataset local copy (if not provided the local cache folder will be verified)
`--filesize`	If `True`, only verify file size and skip hash checks (default: `False`)
`--verbose`	Verbose report all file changes (instead of summary)

get

Get a local copy of a dataset. By default, you get a read only cached folder, but you can get a mutable copy by using the --copy flag.

clearml-data get [-h] [--id ID] [--copy COPY] [--link LINK] [--part PART]
                 [--num-parts NUM_PARTS] [--overwrite] [--verbose]

Parameters

Name	Description	Optional
`--id`	Specify dataset ID. Default: previously created / accessed dataset
`--copy`	Get a writable copy of the dataset to a specific output folder
`--link`	Create a soft link (not supported on Windows) to a read-only cached folder containing the dataset
`--part`	Retrieve a partial copy of the dataset. Part number (0 to `--num-parts`-1) of total parts `--num-parts`.
`--num-parts`	Total number of parts to divide the dataset into. Notice, minimum retrieved part is a single chunk in a dataset (or its parents). Example: Dataset gen4, with 3 parents, each with a single chunk, can be divided into 4 parts
`--overwrite`	If `True`, overwrite the target folder
`--verbose`	Verbose report all file changes (instead of summary)

publish

Publish the dataset for public use. The dataset must be finalized before it is published.

clearml-data publish [-h] --id ID

Parameters

Name	Description	Optional
`--id`	The dataset task ID to be published.

18 KiB Raw Blame History Unescape Escape

create

add

remove

upload

close

sync

list

delete

search

compare

squash

verify

get

publish

18 KiB

Raw Blame History