clearml-docs/docs/clearml_data/clearml_data_cli.md
2021-12-26 15:09:03 +02:00

15 KiB

title
CLI

:::important This page covers clearml-data, ClearML's file-based data management solution. See Hyper-Datasets for ClearML's advanced queryable dataset management solution. :::

The clearml-data utility is a CLI tool for controlling and managing your data with ClearML.

The following page provides a reference to clearml-data's CLI commands.

Creating a Dataset

clearml-data create --project <project_name> --name <dataset_name> --parents <existing_dataset_id>`

Creates a new dataset.

:::tip Locating Dataset ID To locate a dataset's ID, go to the dataset task's info panel in the WebApp. In the top of the panel, to the right of the dataset task name, click ID and the dataset ID appears :::

Parameters

Name Description Optional
--name Dataset's name No
--project Dataset's project No
--parents IDs of the dataset's parents. The dataset inherits all of its parents' content. Multiple parents can be entered, but they are merged in the order they were entered Yes
--tags Dataset user tags. The dataset can be labeled, which can be useful for organizing datasets Yes

:::info clearml-data works in a stateful mode so once a new dataset is created, the following commands do not require the --id flag. :::


Adding Files

clearml-data add --id <dataset_id> --files <filenames/folders_to_add>

It's possible to add individual files or complete folders.

Parameters

Name Description Optional
--id Dataset's ID. Default: previously created / accessed dataset Yes
--files Files / folders to add. Wildcard selection is supported, for example: ~/data/*.jpg ~/data/json No
--dataset-folder Dataset base folder to add the files to in the dataset. Default: dataset root Yes
--non-recursive Disable recursive scan of files Yes
--verbose Verbose reporting Yes

Removing Files

clearml-data remove --id <dataset_id_to_remove_from> --files <filenames/folders_to_remove>

Parameters

Name Description Optional
--id Dataset's ID. Default: previously created / accessed dataset Yes
--files Files / folders to remove (wildcard selection is supported, for example: ~/data/*.jpg ~/data/json). Notice: file path is the path within the dataset, not the local path. No
--non-recursive Disable recursive scan of files Yes
--verbose Verbose reporting Yes

Uploading Dataset Content

clearml-data upload [--id <dataset_id>] [--storage <upload_destination>]

Uploads added files to ClearML Server by default. It's possible to specify a different storage medium by entering an upload destination, such as s3://bucket, gs://, azure://, /mnt/shared/.

Parameters

Name Description Optional
--id Dataset's ID. Default: previously created / accessed dataset Yes
--storage Remote storage to use for the dataset files. Default: files_server Yes
--verbose Verbose reporting Yes

Finalizing a Dataset

clearml-data close --id <dataset_id>

Finalizes the dataset and makes it ready to be consumed. It automatically uploads all files that were not previously uploaded. Once a dataset is finalized, it can no longer be modified.

Parameters

Name Description Optional
--id Dataset's ID. Default: previously created / accessed dataset Yes
--storage Remote storage to use for the dataset files. Default: files_server Yes
--disable-upload Disable automatic upload when closing the dataset Yes
--verbose Verbose reporting Yes

Syncing Local Storage

clearml-data sync [--id <dataset_id] --folder <folder_location>  [--parents '<parent_id>']`

This option syncs a folder's content with ClearML. It is useful in case a user has a single point of truth (i.e. a folder) which updates from time to time.

Once an update should be reflected into ClearML's system, users can call clearml-data sync, create a new dataset, enter the folder, and the changes (either file addition, modification and removal) will be reflected in ClearML.

This command also uploads the data and finalizes the dataset automatically.

Parameters

Name Description Optional
--id Dataset's ID. Default: previously created / accessed dataset Yes
--folder Local folder to sync. Wildcard selection is supported, for example: ~/data/*.jpg ~/data/json No
--storage Remote storage to use for the dataset files. Default: files_server Yes
--parents IDs of the dataset's parents (i.e. merge all parents). All modifications made to the folder since the parents were synced will be reflected in the dataset Yes
--project If creating a new dataset, specify the dataset's project name Yes
--name If creating a new dataset, specify the dataset's name Yes
--tags Dataset user tags Yes
--skip-close Do not auto close dataset after syncing folders Yes
--verbose Verbose reporting Yes

Listing Dataset Content

clearml-data list [--id <dataset_id>]

Parameters

Name Description Optional
--id Dataset ID whose contents will be shown (alternatively, use project / name combination). Default: previously accessed dataset Yes
--project Specify dataset project name (if used instead of ID, dataset name is also required) Yes
--name Specify dataset name (if used instead of ID, dataset project is also required) Yes
--filter Filter files based on folder / wildcard. Multiple filters are supported. Example: folder/date_*.json folder/sub-folder Yes
--modified Only list file changes (add / remove / modify) introduced in this version Yes

Deleting a Dataset

clearml-data delete [--id <dataset_id_to_delete>]

Deletes an entire dataset from ClearML. This can also be used to delete a newly created dataset.

This does not work on datasets with children.

Parameters

Name Description Optional
--id ID of dataset to be deleted. Default: previously created / accessed dataset that hasn't been finalized yet Yes
--force Force dataset deletion even if other dataset versions depend on it Yes

Searching for a Dataset

clearml-data search [--name <name>] [--project <project_name>] [--tags <tag>]

Lists all datasets in the system that match the search request.

Datasets can be searched by project, name, ID, and tags.

Parameters

Name Description Optional
--ids A list of dataset IDs
--project The project name of the datasets
--name A dataset name or a partial name to filter datasets by
--tags A list of dataset user tags

Comparing Two Datasets

clearml-data compare [--source SOURCE] [--target TARGET] 

Compare two datasets (target vs. source). The command returns a comparison summary that looks like this:

Comparison summary: 4 files removed, 3 files modified, 0 files added

Parameters

Name Description Optional
--source Source dataset id (used as baseline) No
--target Target dataset id (compare against the source baseline dataset) No
--verbose Verbose report all file changes (instead of summary) Yes

Merging Datasets

clearml-data squash --name NAME --ids [IDS [IDS ...]] 

Squash (merge) multiple datasets into a single dataset version.

Parameters

Name Description Optional
--name Create squashed dataset name No
--ids Source dataset IDs to squash (merge down) No
--storage Remote storage to use for the dataset files. Default: files_server Yes
--verbose Verbose report all file changes (instead of summary) Yes

Verifying a Dataset

clearml-data verify [--id ID] [--folder FOLDER] 

Verify that the dataset content matches the data from the local source.

Parameters

Name Description Optional
--id Specify dataset ID. Default: previously created/accessed dataset Yes
--folder Specify dataset local copy (if not provided the local cache folder will be verified) Yes
--filesize If True, only verify file size and skip hash checks (default: false) Yes
--verbose Verbose report all file changes (instead of summary) Yes

Getting a Dataset

clearml-data get [--id ID] [--copy COPY] [--link LINK] [--overwrite]

Get a local copy of a dataset. By default, you get a read only cached folder, but you can get a mutable copy by using the --copy flag.

Parameters

Name Description Optional
--id Specify dataset ID. Default: previously created / accessed dataset Yes
--copy Get a writable copy of the dataset to a specific output folder Yes
--link Create a soft link (not supported on Windows) to a read-only cached folder containing the dataset Yes
--overwrite If True, overwrite the target folder Yes
--verbose Verbose report all file changes (instead of summary) Yes

Publishing a Dataset

clearml-data publish --id ID

Publish the dataset for public use. The dataset must be finalized before it is published.

Parameters

Name Description Optional
--id The dataset task id to be published. No