17 KiB
title
| title |
|---|
| ClearML Data |
In Machine Learning, you are very likely dealing with a gargantuan amount of data that you need to put in a dataset, which you then need to be able to share, reproduce, and track.
ClearML Data Management solves two important challenges:
- Accessibility - Making data easily accessible from every machine,
- Versioning - Linking data and experiments for better traceability.
We believe Data is not code. It should not be stored in a git tree, because progress on datasets is not always linear. Moreover, it can be difficult and inefficient to find on a git tree the commit associated with a certain version of a dataset.
A clearml-data dataset is a collection of files, stored on a central storage location (S3 \ GS \ Azure \ Network Storage).
Datasets can be set up to inherit from other datasets, so data lineages can be created,
and users can track when and how their data changes.
Dataset changes are stored using differentiable storage, meaning a version will store the change-set from its previous dataset parents
Local copies of datasets are always cached, so the same data never needs to be downloaded twice. When a dataset is pulled it will automatically pull all parent datasets and merge them into one output folder for you to work with
ClearML-data offers two interfaces:
clearml-data- CLI utility for creating, uploading, and managing datasets.clearml.Dataset- A python interface for creating, retrieving, managing, and using datasets.
Creating a Dataset
Using the clearml-data CLI, users can create datasets using the following commands:
clearml-data create --project dataset_example --name initial_version
clearml-data add --files data_folder
The commands will do the following:
-
Start a Data Processing Task called "initial_version" in the "dataset_example" project
-
The CLI will return a unique ID for the dataset
-
All the files from the "data_folder" folder will be added to the dataset and uploaded by default to the ClearML server.
:::note
clearml-data is stateful and remembers the last created dataset so there's no need to specify a specific dataset ID unless
we want to work on another dataset.
:::
Using a Dataset
Now in our python code, we can access and use the created dataset from anywhere:
from clearml import Dataset
local_path = Dataset.get(dataset_id='dataset_id_from_previous_command').get_local_copy()
We have all our files in the same folder structure under local_path, it is that simple!
The next step is to set the dataset_id as a parameter for our code and voilà! We can now train on any dataset we have in the system.
Setup
clearml-data comes built-in with our clearml python package! Just check out the getting started guide for more info!
Usage
CLI
It's possible to manage datasets (create \ modify \ upload \ delete) with the clearml-data command line tool.
Creating a Dataset
clearml-data create --project <project_name> --name <dataset_name> --parents <existing_dataset_id>`
Creates a new dataset.
Parameters
:::important
clearml-data works in a stateful mode so once a new dataset is created, the following commands
do not require the --id flag.
:::
Add Files to Dataset
clearml-data add --id <dataset_id> --files <filenames/folders_to_add>
It's possible to add individual files or complete folders.
Parameters
Remove Files From Dataset
clearml-data remove --id <dataset_id_to_remove_from> --files <filenames/folders_to_remove>
Parameters
Finalize Dataset
clearml-data close --id <dataset_id>
Finalizes the dataset and makes it ready to be consumed. It automatically uploads all files that were not previously uploaded. Once a dataset is finalized, it can no longer be modified.
Parameters
Upload Dataset' Content
clearml-data upload [--id <dataset_id>] [--storage <upload_destination>]
Uploads added files to ClearML Server by default. It's possible to specify a different storage
medium by entering an upload destination, such as s3://bucket, gs://, azure://, /mnt/shared/.
Parameters
| Name | Description | Optional |
|---|---|---|
| id | Dataset's ID. Default: previously created / accessed dataset | |
| storage | Remote storage to use for the dataset files. Default: files_server | |
| verbose | Verbose reporting |
Sync Local Folder
clearml-data sync [--id <dataset_id] --folder <folder_location> [--parents '<parent_id>']`
This option syncs a folder's content with ClearML. It is useful in case a user has a single point of truth (i.e. a folder) which updates from time to time.
Once an update should be reflected into ClearML's system, users can call clearml-data sync, create a new dataset, enter the folder,
and the changes (either file addition, modification and removal) will be reflected in ClearML.
This command also uploads the data and finalizes the dataset automatically.
Parameters
List Dataset Content
clearml-data list [--id <dataset_id>]
Parameters
Delete a Dataset
clearml-data delete [--id <dataset_id_to_delete>]
Deletes an entire dataset from ClearML. This can also be used to delete a newly created dataset.
This does not work on datasets with children.
Parameters
Search for a Dataset
clearml-data search [--name <name>] [--project <project_name>] [--tags <tag>]
Lists all datasets in the system that match the search request.
Datasets can be searched by project, name, ID, and tags.
Parameters
| Name | Description | Optional |
|---|---|---|
| ids | A list of dataset IDs | |
| project | The project name of the datasets | |
| name | A dataset name or a partial name to filter datasets by | |
| tags | A list of dataset user tags |
Python API
All API commands should be imported with
from clearml import Dataset
Dataset.get(dataset_id=DS_ID).get_local_copy()
Returns a path to dataset in cache, and downloads it if it is not already in cache.
Parameters
| Name | Description | Optional |
|---|---|---|
| use_soft_links | If True, use soft links. Default: False on Windows, True on Posix systems | |
| raise_on_error | If True, raise exception if dataset merging failed on any file |
Dataset.get(dataset_id=DS_ID).get_mutable_local_copy()
Downloads the dataset to a specific folder (non-cached). If the folder already has contents, specify whether to overwrite its contents with the dataset contents.
Parameters
Dataset.create()
Create a new dataset.
Parent datasets can be specified, and the new dataset inherits all of its parent's content. Multiple dataset parents can be listed. Merging of parent datasets is done based on the list's order, where each parent can override overlapping files in the previous parent dataset.
Parameters
Dataset.add_files()
Add files or folder into the current dataset.
Parameters
Dataset.upload()
Start file uploading, the function returns when all files are uploaded.
Parameters
Dataset.finalize()
Closes the dataset and marks it as Completed. After a dataset has been closed, it can no longer be modified. Before closing a dataset, its files must first be uploaded.
Parameters
| Name | Description | Optional |
|---|---|---|
| verbose | If True, print verbose progress report | |
| raise_on_error | If True, raise exception if dataset finalizing failed |