clearml-docs/clearml_data.md at 40be99822e37f3d4c13c304b90ea33095cd36d30

mirror of https://github.com/clearml/clearml-docs synced 2025-01-31 14:37:18 +00:00

allegroai 83c20f3d54 Fix icons

2021-05-23 23:17:12 +03:00

17 KiB

Raw Blame History

title
ClearML Data

In Machine Learning, you are very likely dealing with a gargantuan amount of data that you need to put in a dataset, which you then need to be able to share, reproduce, and track.

ClearML Data Management solves two important challenges:

Accessibility - Making data easily accessible from every machine,
Versioning - Linking data and experiments for better traceability.

We believe Data is not code. It should not be stored in a git tree, because progress on datasets is not always linear. Moreover, it can be difficult and inefficient to find on a git tree the commit associated with a certain version of a dataset.

A clearml-data dataset is a collection of files, stored on a central storage location (S3 \ GS \ Azure \ Network Storage). Datasets can be set up to inherit from other datasets, so data lineages can be created, and users can track when and how their data changes.
Dataset changes are stored using differentiable storage, meaning a version will store the change-set from its previous dataset parents

Local copies of datasets are always cached, so the same data never needs to be downloaded twice. When a dataset is pulled it will automatically pull all parent datasets and merge them into one output folder for you to work with

ClearML-data offers two interfaces:

clearml-data - CLI utility for creating, uploading, and managing datasets.
clearml.Dataset - A python interface for creating, retrieving, managing, and using datasets.

Creating a Dataset

Using the clearml-data CLI, users can create datasets using the following commands:

clearml-data create --project dataset_example --name initial_version
clearml-data add --files data_folder

The commands will do the following:

Start a Data Processing Task called "initial_version" in the "dataset_example" project
The CLI will return a unique ID for the dataset
All the files from the "data_folder" folder will be added to the dataset and uploaded by default to the ClearML server.

:::note clearml-data is stateful and remembers the last created dataset so there's no need to specify a specific dataset ID unless we want to work on another dataset. :::

Using a Dataset

Now in our python code, we can access and use the created dataset from anywhere:

from clearml import Dataset

local_path = Dataset.get(dataset_id='dataset_id_from_previous_command').get_local_copy()

We have all our files in the same folder structure under local_path, it is that simple!

The next step is to set the dataset_id as a parameter for our code and voilà! We can now train on any dataset we have in the system.

Setup

clearml-data comes built-in with our clearml python package! Just check out the getting started guide for more info!

Usage

CLI

It's possible to manage datasets (create \ modify \ upload \ delete) with the clearml-data command line tool.

Creating a Dataset

clearml-data create --project <project_name> --name <dataset_name> --parents <existing_dataset_id>`

Creates a new dataset.

Parameters

Name	Description	Optional
name	Dataset's name
project	Dataset's project
parents	IDs of the dataset's parents. The dataset inherits all of its parents' content. Multiple parents can be entered, but they are merged in the order they were entered
tags	Dataset user tags. The dataset can be labeled, which can be useful for organizing datasets

:::important clearml-data works in a stateful mode so once a new dataset is created, the following commands do not require the --id flag. :::

Add Files to Dataset

clearml-data add --id <dataset_id> --files <filenames/folders_to_add>

It's possible to add individual files or complete folders.

Parameters

Name	Description	Optional
id	Dataset's ID. Default: previously created / accessed dataset
files	Files / folders to add. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`
dataset-folder	Dataset base folder to add the files to in the dataset. Default: dataset root
non-recursive	Disable recursive scan of files
verbose	Verbose reporting

Remove Files From Dataset

clearml-data remove --id <dataset_id_to_remove_from> --files <filenames/folders_to_remove>

Parameters

Name	Description	Optional
id	Dataset's ID. Default: previously created / accessed dataset
files	Files / folders to remove (wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`). Notice: file path is the path within the dataset, not the local path.
non-recursive	Disable recursive scan of files
verbose	Verbose reporting

Finalize Dataset

clearml-data close --id <dataset_id>

Finalizes the dataset and makes it ready to be consumed. It automatically uploads all files that were not previously uploaded. Once a dataset is finalized, it can no longer be modified.

Parameters

Name	Description	Optional
id	Dataset's ID. Default: previously created / accessed dataset
storage	Remote storage to use for the dataset files. Default: files_server
disable-upload	Disable automatic upload when closing the dataset
verbose	Verbose reporting

Upload Dataset' Content

clearml-data upload [--id <dataset_id>] [--storage <upload_destination>]

Uploads added files to ClearML Server by default. It's possible to specify a different storage medium by entering an upload destination, such as s3://bucket, gs://, azure://, /mnt/shared/.

Parameters

Name	Description	Optional
id	Dataset's ID. Default: previously created / accessed dataset
storage	Remote storage to use for the dataset files. Default: files_server
verbose	Verbose reporting

Sync Local Folder

clearml-data sync [--id <dataset_id] --folder <folder_location>  [--parents '<parent_id>']`

This option syncs a folder's content with ClearML. It is useful in case a user has a single point of truth (i.e. a folder) which updates from time to time.

Once an update should be reflected into ClearML's system, users can call clearml-data sync, create a new dataset, enter the folder, and the changes (either file addition, modification and removal) will be reflected in ClearML.

This command also uploads the data and finalizes the dataset automatically.

Parameters

Name	Description	Optional
id	Dataset's ID. Default: previously created / accessed dataset
folder	Local folder to sync. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`
storage	Remote storage to use for the dataset files. Default: files_server
parents	IDs of the dataset's parents (i.e. merge all parents). All modifications made to the folder since the parents were synced will be reflected in the dataset
project	If creating a new dataset, specify the dataset's project name
name	If creating a new dataset, specify the dataset's name
tags	Dataset user tags
skip-close	Do not auto close dataset after syncing folders
verbose	Verbose reporting

List Dataset Content

clearml-data list [--id <dataset_id>]

Parameters

Name	Description	Optional
id	Dataset ID whose contents will be shown (alternatively, use project / name combination). Default: previously accessed dataset
project	Specify dataset project name (if used instead of ID, dataset name is also required)
name	Specify dataset name (if used instead of ID, dataset project is also required)
filter	Filter files based on folder / wildcard. Multiple filters are supported. Example: `folder/date_*.json folder/sub-folder`
modified	Only list file changes (add / remove / modify) introduced in this version

Delete a Dataset

clearml-data delete [--id <dataset_id_to_delete>]

Deletes an entire dataset from ClearML. This can also be used to delete a newly created dataset.

This does not work on datasets with children.

Parameters

Name	Description	Optional
id	ID of dataset to be deleted. Default: previously created / accessed dataset that hasn't been finalized yet
force	Force dataset deletion even if other dataset versions depend on it

Search for a Dataset

clearml-data search [--name <name>] [--project <project_name>] [--tags <tag>]

Lists all datasets in the system that match the search request.

Datasets can be searched by project, name, ID, and tags.

Parameters

Name	Description	Optional
ids	A list of dataset IDs
project	The project name of the datasets
name	A dataset name or a partial name to filter datasets by
tags	A list of dataset user tags

Python API

All API commands should be imported with
from clearml import Dataset

`Dataset.get(dataset_id=DS_ID).get_local_copy()`

Returns a path to dataset in cache, and downloads it if it is not already in cache.

Parameters

Name	Description	Optional
use_soft_links	If True, use soft links. Default: False on Windows, True on Posix systems
raise_on_error	If True, raise exception if dataset merging failed on any file

`Dataset.get(dataset_id=DS_ID).get_mutable_local_copy()`

Downloads the dataset to a specific folder (non-cached). If the folder already has contents, specify whether to overwrite its contents with the dataset contents.

Parameters

Name	Description	Optional
target_folder	Local target folder for the writable copy of the dataset
overwrite	If True, recursively delete the contents of the target folder before creating a copy of the dataset. If False (default) and target folder contains files, raise exception or return None
raise_on_error	If True, raise exception if dataset merging failed on any file

`Dataset.create()`

Create a new dataset.

Parent datasets can be specified, and the new dataset inherits all of its parent's content. Multiple dataset parents can be listed. Merging of parent datasets is done based on the list's order, where each parent can override overlapping files in the previous parent dataset.

Parameters

Name	Description	Optional
dataset_name	Name of the new dataset
dataset_project	The project containing the dataset. If not specified, infer project name from parent datasets. If there is no parent dataset, then this value is required
parent_datasets	Expand a parent dataset by adding / removing files
use_current_task	If True, the dataset is created on the current Task. Default: False

`Dataset.add_files()`

Add files or folder into the current dataset.

Parameters

Name	Description	Optional
path	Add a folder / file to the dataset
wildcard	Add only a specific set of files based on wildcard matching. Wildcard matching can be a single string or a list of wildcards, for example: `~/data/*.jpg`, `~/data/json`
local_base_folder	Files will be located based on their relative path from local_base_folder
dataset_path	Where in the dataset the folder / files should be located
recursive	If True, match all wildcard files recursively
verbose	If True, print to console files added / modified

`Dataset.upload()`

Start file uploading, the function returns when all files are uploaded.

Parameters

Name	Description	Optional
show_progress	If True, show upload progress bar
verbose	If True, print verbose progress report
output_url	Target storage for the compressed dataset (default: file server). Examples: `s3://bucket/data`, `gs://bucket/data` , `azure://bucket/data`, `/mnt/share/data`
compression	Compression algorithm for the Zipped dataset file (default: ZIP_DEFLATED)

`Dataset.finalize()`

Closes the dataset and marks it as Completed. After a dataset has been closed, it can no longer be modified. Before closing a dataset, its files must first be uploaded.

Parameters

Name	Description	Optional
verbose	If True, print verbose progress report
raise_on_error	If True, raise exception if dataset finalizing failed

17 KiB Raw Blame History

Creating a Dataset

Using a Dataset

Setup

Usage

CLI

Creating a Dataset

Add Files to Dataset

Remove Files From Dataset

Finalize Dataset

Upload Dataset' Content

Sync Local Folder

List Dataset Content

Delete a Dataset

Search for a Dataset

Python API

Dataset.get(dataset_id=DS_ID).get_local_copy()

Dataset.get(dataset_id=DS_ID).get_mutable_local_copy()

Dataset.create()

Dataset.add_files()

Dataset.upload()

Dataset.finalize()

17 KiB

Raw Blame History

`Dataset.get(dataset_id=DS_ID).get_local_copy()`

`Dataset.get(dataset_id=DS_ID).get_mutable_local_copy()`

`Dataset.create()`

`Dataset.add_files()`

`Dataset.upload()`

`Dataset.finalize()`