15 KiB
title
| title |
|---|
| CLI |
:::important
This page covers clearml-data, ClearML's file-based data management solution.
See Hyper-Datasets for ClearML's advanced queryable dataset management solution.
:::
The clearml-data utility is a CLI tool for controlling and managing your data with ClearML.
The following page provides a reference to clearml-data's CLI commands.
Creating a Dataset
clearml-data create --project <project_name> --name <dataset_name> --parents <existing_dataset_id>`
Creates a new dataset.
:::tip Locating Dataset ID
To locate a dataset's ID, go to the dataset task's info panel in the WebApp. In the top of the panel,
to the right of the dataset task name, click ID and the dataset ID appears
:::
Parameters
:::info
clearml-data works in a stateful mode so once a new dataset is created, the following commands
do not require the --id flag.
:::
Adding Files
clearml-data add --id <dataset_id> --files <filenames/folders_to_add>
It's possible to add individual files or complete folders.
Parameters
Removing Files
clearml-data remove --id <dataset_id_to_remove_from> --files <filenames/folders_to_remove>
Parameters
Uploading Dataset Content
clearml-data upload [--id <dataset_id>] [--storage <upload_destination>]
Uploads added files to ClearML Server by default. It's possible to specify a different storage
medium by entering an upload destination, such as s3://bucket, gs://, azure://, /mnt/shared/.
Parameters
| Name | Description | Optional |
|---|---|---|
--id |
Dataset's ID. Default: previously created / accessed dataset | |
--storage |
Remote storage to use for the dataset files. Default: files_server | |
--verbose |
Verbose reporting |
Finalizing a Dataset
clearml-data close --id <dataset_id>
Finalizes the dataset and makes it ready to be consumed. It automatically uploads all files that were not previously uploaded. Once a dataset is finalized, it can no longer be modified.
Parameters
Syncing Local Storage
clearml-data sync [--id <dataset_id] --folder <folder_location> [--parents '<parent_id>']`
This option syncs a folder's content with ClearML. It is useful in case a user has a single point of truth (i.e. a folder) which updates from time to time.
Once an update should be reflected into ClearML's system, users can call clearml-data sync, create a new dataset, enter the folder,
and the changes (either file addition, modification and removal) will be reflected in ClearML.
This command also uploads the data and finalizes the dataset automatically.
Parameters
Listing Dataset Content
clearml-data list [--id <dataset_id>]
Parameters
Deleting a Dataset
clearml-data delete [--id <dataset_id_to_delete>]
Deletes an entire dataset from ClearML. This can also be used to delete a newly created dataset.
This does not work on datasets with children.
Parameters
Searching for a Dataset
clearml-data search [--name <name>] [--project <project_name>] [--tags <tag>]
Lists all datasets in the system that match the search request.
Datasets can be searched by project, name, ID, and tags.
Parameters
| Name | Description | Optional |
|---|---|---|
--ids |
A list of dataset IDs | |
--project |
The project name of the datasets | |
--name |
A dataset name or a partial name to filter datasets by | |
--tags |
A list of dataset user tags |
Comparing Two Datasets
clearml-data compare [--source SOURCE] [--target TARGET]
Compare two datasets (target vs. source). The command returns a comparison summary that looks like this:
Comparison summary: 4 files removed, 3 files modified, 0 files added
Parameters
Merging Datasets
clearml-data squash --name NAME --ids [IDS [IDS ...]]
Squash (merge) multiple datasets into a single dataset version.
Parameters
Verifying a Dataset
clearml-data verify [--id ID] [--folder FOLDER]
Verify that the dataset content matches the data from the local source.
Parameters
Getting a Dataset
clearml-data get [--id ID] [--copy COPY] [--link LINK] [--overwrite]
Get a local copy of a dataset. By default, you get a read only cached folder, but you can get a mutable copy by using the
--copy flag.
Parameters
Publishing a Dataset
clearml-data publish --id ID
Publish the dataset for public use. The dataset must be finalized before it is published.
Parameters