clearml-docs/clearml_data_sdk.md at b8a8df83763a2b9e5b4bbb0b8e568a9a72ffca07

mirror of https://github.com/clearml/clearml-docs synced 2025-01-31 06:27:22 +00:00

2024-02-26 17:24:58 +02:00

16 KiB

Raw Blame History

title
SDK

:::important This page covers clearml-data, ClearML's file-based data management solution. See Hyper-Datasets for ClearML's advanced queryable dataset management solution. :::

Datasets can be created, modified, and managed with ClearML Data's python interface. You can upload your dataset to any storage service of your choice (S3 / GS / Azure / Network Storage) by setting the dataset's upload destination (see output_url parameter of Dataset.upload()). Once you have uploaded your dataset, you can access it from any machine.

The following page provides an overview for using the most basic methods of the Dataset class. See the Dataset reference page for a complete list of available methods.

Import the Dataset class, and let's get started!

from clearml import Dataset

Creating Datasets

ClearML Data supports multiple ways to create datasets programmatically, which provides for a variety of use-cases:

Dataset.create() - Create a new dataset. Parent datasets can be specified, from which the new dataset will inherit its data
Dataset.squash() - Generate a new dataset from by squashing together a set of related datasets

You can add metadata to your datasets using Dataset.set_metadata(), and access the metadata using Dataset.get_metadata().

Dataset.create()

Use the Dataset.create class method to create a dataset.

Creating datasets programmatically is especially helpful when preprocessing the data so that the preprocessing code and the resulting dataset are saved in a single task (see use_current_task parameter in Dataset.create).

# Preprocessing code here
dataset = Dataset.create(
  dataset_name='dataset name',
  dataset_project='dataset project', 
  parent_datasets=[PARENT_DS_ID_1, PARENT_DS_ID_2],
  dataset_version="1.0",
  output_uri="gs://bucket-name/folder",
  description='my dataset description'
)

:::tip Locating Dataset ID For datasets created with clearml v1.6 or newer on ClearML Server v1.6 or newer, find the ID in the dataset version's info panel in the Dataset UI.
For datasets created with earlier versions of clearml, or if using an earlier version of ClearML Server, find the ID in the task header of the dataset task's info panel.
:::

:::info Dataset Version Input the dataset's version using the semantic versioning scheme (e.g. 1.0.1, 2.0). If a version is not input, the method tries finding the latest dataset version with the specified dataset_name and dataset_project and auto-increments the version number. :::

Use the output_uri parameter to specify a network storage target to upload the dataset files, and associated information (such as previews) to. For example:

A shared folder: /mnt/share/folder
S3: s3://bucket/folder
Non-AWS S3-like services (e.g. MinIO): s3://host_addr:port/bucket
Google Cloud Storage: gs://bucket-name/folder
Azure Storage: azure://<account name>.blob.core.windows.net/path/to/file

By default, the dataset uploads to ClearML's file server. The output_uri parameter of Dataset.upload() overrides this parameter's value.

The created dataset inherits the content of the parent_datasets. When multiple dataset parents are listed, they are merged in order of specification. Each parent overrides any overlapping files from a previous parent dataset.

Dataset.squash()

To improve deep dataset DAG storage and speed, dataset squashing was introduced. The Dataset.squash class method generates a new dataset by squashing a set of dataset versions, and merging down all changes introduced in their lineage DAG, creating a new, flat, independent version.

The datasets being squashed into a single dataset can be specified by their IDs or by project and name pairs.

# option 1 - list dataset IDs
squashed_dataset_1 = Dataset.squash(
  dataset_name='squashed dataset\'s name',
  dataset_ids=[DS1_ID, DS2_ID, DS3_ID]
)

# option 2 - list project and dataset pairs 
squashed_dataset_2 = Dataset.squash(
  dataset_name='squashed dataset 2',
  dataset_project_name_pairs=[('dataset1 project', 'dataset1 name'), 
                              ('dataset2 project', 'dataset2 name')]
)

In addition, the target storage location for the squashed dataset can be specified using the output_uri parameter of Dataset.squash().

Accessing Datasets

Once a dataset has been created and uploaded to a server, the dataset can be accessed programmatically from anywhere.

Use the Dataset.get class method to access a specific Dataset object, by providing any of the dataset's following attributes: dataset ID, project, name, tags, and or version. If multiple datasets match the query, the most recent one is returned.

dataset = Dataset.get(
        dataset_id=None,  
        dataset_project="Example Project",
        dataset_name="Example Dataset",
        dataset_tags="my tag",
        dataset_version="1.2",
        only_completed=True, 
        only_published=False, 
)

Pass auto_create=True, and a dataset will be created on-the-fly with the input attributes (project name, dataset name, and tags) if no datasets match the query.

In cases where you use a dataset in a task (e.g. consuming a dataset), you can have its ID stored in the task's hyperparameters: pass alias=<dataset_alias_string>, and the task using the dataset will store the dataset's ID in the dataset_alias_string parameter under the Datasets hyperparameters section. This way you can easily track which dataset the task is using. If you use alias with overridable=True, you can override the dataset ID from the UI's CONFIGURATION > HYPERPARAMETERS > Datasets section, allowing you to change the dataset used when running a task remotely.

In case you want to get a modifiable dataset, you can get a newly created mutable dataset with the current one as its parent, by passing writable_copy=True.

Once a specific dataset object has been obtained, get a local copy of the dataset using one of the following options:

Dataset.get_local_copy() - get a read-only local copy of an entire dataset. This method returns a path to the dataset in local cache (downloading the dataset if it is not already in cache).
Dataset.get_mutable_local_copy() - get a writable local copy of an entire dataset. This method downloads the dataset to a specific folder (non-cached), specified with the target_folder parameter. If the specified folder already has contents, specify whether to overwrite its contents with the dataset contents, using the overwrite parameter.

ClearML supports parallel downloading of datasets. Use the max_workers parameter of the Dataset.get_local_copy or Dataset.get_mutable_copy methods to specify the number of threads to use when downloading the dataset. By default, it's the number of your machine's logical cores.

Modifying Datasets

Once a dataset has been created, its contents can be modified and replaced. When your data is changed, you can add updated files or remove unnecessary files.

add_files()

To add local files or folders into the current dataset, use the Dataset.add_files method.

If a file is already in a dataset, but it has been modified, it can be added again, and ClearML will upload the file diff.

dataset = Dataset.create(dataset_name="my dataset", dataset_project="example project")
dataset.add_files(path="path/to/folder_or_file")

You can add a set of files based on wildcard matching of a single string or a list of strings, using the wildcard parameter. Specify whether to match the wildcard files recursively using the recursive parameter.

For example:

dataset.add_files(
  path="path/to/folder",
  wildcard="~/data/*.jpg",
  recursive=True
)

add_external_files()

To add files or folders to the current dataset, leaving them in their original location, use the Dataset.add_external_files method. Input the source_url argument, which can be a link or a list of links from cloud storage (s3://, gs://, azure://) or local / network storage (file://).

dataset = Dataset.create(dataset_name="my dataset", dataset_project="example project")
dataset.add_external_files(
  source_url="s3://my/bucket/path_to_folder_or_file", 
  dataset_path="/my_dataset/new_folder/"
)
dataset.add_external_files(
  source_url=[
    "s3://my/bucket/path_to_folder_or_file",
    "s3://my/bucket/path_to_another_folder_or_file", 
  ],
  dataset_path="/my_dataset/new_folder/"
)

You can add a set of files based on wildcard matching of a single string or a list of wildcards using the wildcard parameter. Specify whether to match the wildcard files recursively using the recursive parameter.

# Add all jpg files located in s3 bucket called "my_bucket" to the dataset:
dataset.add_external_files(
  source_url="s3://my/bucket/", 
  wildcard = "*.jpg",
  dataset_path="/my_dataset/new_folder/"
)

remove_files()

To remove files from a current dataset, use the Dataset.remove_files method. Input the path to the folder or file to be removed in the dataset_path parameter. The path is relative to the dataset. To remove links, specify their URL (e.g. s3://bucket/file).

You can also input a wildcard into dataset_path in order to remove a set of files matching the wildcard. Set the recursive parameter to True in order to match all wildcard files recursively

For example:

dataset.remove_files(dataset_path="*.csv", recursive=True)

Dataset Preview

Add informative metrics, plots, or media to the Dataset. Use Dataset.get_logger() to access the dataset's logger object, then add any additional information to the dataset, using the methods available with a logger object.

You can add some dataset summaries (like table reporting) to create a preview of the data stored for better visibility, or attach any statistics generated by the data ingestion process.

For example:

# Attach a table to the dataset
dataset.get_logger().report_table(
    title="Raw Dataset Metadata", series="Raw Dataset Metadata", csv="path/to/csv"
)

# Attach a historgram to the table
dataset.get_logger().report_histogram(
    title="Class distribution",
    series="Class distribution",
    values=histogram_data,
    iteration=0,
    xlabels=histogram_data.index.tolist(),
    yaxis="Number of samples",
)

Uploading Files

To upload the dataset files to network storage, use the Dataset.upload method.

Use the output_url parameter to specify storage target, such as S3 / GS / Azure. For example:

A shared folder: /mnt/share/folder
S3: s3://bucket/folder
Non-AWS S3-like services (e.g. MinIO): s3://host_addr:port/bucket
Google Cloud Storage: gs://bucket-name/folder
Azure Storage: azure://<account name>.blob.core.windows.net/path/to/file

By default, the dataset uploads to ClearML's file server. This target storage overrides the output_uri value of the Dataset.create method.

ClearML supports parallel uploading of datasets. Use the max_workers parameter to specify the number of threads to use when uploading the dataset. By default, it's the number of your machine's logical cores.

Dataset files must be uploaded before a dataset is finalized.

Finalizing a Dataset

Use Dataset.finalize() to close the current dataset. This marks the dataset task as Completed, at which point, the dataset can no longer be modified.

Before closing a dataset, its files must first be uploaded.

Syncing Local Storage

Use Dataset.sync_folder() in order to update a dataset according to a specific folder's content changes. Specify the folder to sync with the local_path parameter (the method assumes all files within the folder and recursive).

This method is useful in the case where there's a single point of truth, either a local or network folder, that gets updated periodically. The folder changes will be reflected in a new dataset version. This method saves time since you don't have to manually update (add / remove) files in a dataset.

Deleting Datasets

Delete a dataset using Dataset.delete() method. Input any of the attributes of the dataset(s) you want to delete, including ID, project name, version, and/or dataset name. Multiple datasets matching the query will raise an exception, unless you pass entire_dataset=True and force=True. In this case, all matching datasets will be deleted.

If a dataset is a parent to a dataset(s), you must pass force=True in order to delete it.

:::caution Deleting a parent dataset may cause child datasets to lose data! :::

Dataset.delete(
  dataset_id=None,
  dataset_project="example project",
  dataset_name="example dataset", 
  force=False,  
  dataset_version="3.0",
  entire_dataset=False
)

Renaming Datasets

Rename a dataset using the Dataset.rename class method. All the datasets with the given dataset_project and dataset_name will be renamed.

Dataset.rename(
  new_dataset_name="New name",
  dataset_project="Example project", 
  dataset_name="Example dataset",  
)

Moving Datasets to Another Project

Move a dataset to another project using the Dataset.move_to_project class method. All the datasets with the given dataset_project and dataset_name will be moved to the new dataset project.

Dataset.move_to_project(
  new_dataset_project="New project",
  dataset_project="Example project", 
  dataset_name="Example dataset",  
)

Offline Mode

You can work with datasets in Offline Mode, in which all the data and logs are stored in a local session folder, which can later be uploaded to the ClearML Server.

You can enable offline mode in one of the following ways:

Before creating a dataset, use Dataset.set_offline() and set the offline_mode argument to True:

from clearml import Dataset

# Use the set_offline class method before creating a Dataset
Dataset.set_offline(offline_mode=True)
# Create a dataset
dataset = Dataset.create(dataset_name="Dataset example", dataset_project="Example project")
# add files to dataset
dataset.add_files(path='my_image.jpg')

Before creating a dataset, set CLEARML_OFFLINE_MODE=1

All the dataset's information is zipped and is saved locally.

The dataset task's console output displays the task's ID and a path to the local dataset folder:

ClearML Task: created new task id=offline-372657bb04444c25a31bc6af86552cc9
...
...
ClearML Task: Offline session stored in /home/user/.clearml/cache/offline/b786845decb14eecadf2be24affc7418.zip

Note that in offline mode, any methods that require communicating with the server have no effect (e.g. squash(), finalize(), get_local_copy(), get(), move_to_project(), etc.).

Upload the offline dataset to the ClearML Server using Dataset.import_offline_session(). In the session_folder_zip argument, insert the path to the zip folder containing the dataset. To upload the dataset's data to network storage, set upload to True. To finalize the dataset, which will close it and prevent further modifications to the dataset, set finalize to True.

Dataset.import_offline_session(session_folder_zip="<path_to_offline_dataset>", upload=True, finalize=True)

16 KiB Raw Blame History

Creating Datasets

Dataset.create()

Dataset.squash()

Accessing Datasets

Modifying Datasets

add_files()

add_external_files()

remove_files()

Dataset Preview

Uploading Files

Finalizing a Dataset

Syncing Local Storage

Deleting Datasets

Renaming Datasets

Moving Datasets to Another Project

Offline Mode

16 KiB

Raw Blame History