2021-11-08 11:21:44 +00:00
---
title: SDK
---
2021-11-10 12:52:06 +00:00
:::important
This page covers `clearml-data` , ClearML's file-based data management solution.
See [Hyper-Datasets ](../hyperdatasets/overview.md ) for ClearML's advanced queryable dataset management solution.
:::
2021-11-08 11:21:44 +00:00
2022-08-01 08:25:54 +00:00
Datasets can be created, modified, and managed with ClearML Data's python interface. You can upload your dataset to any
2023-04-16 07:13:04 +00:00
storage service of your choice (S3 / GS / Azure / Network Storage) by setting the dataset’ s upload destination (see
2023-06-13 09:21:35 +00:00
[`output_url` ](#uploading-files ) parameter of `Dataset.upload()` ). Once you have uploaded your dataset, you can access
2022-08-01 08:25:54 +00:00
it from any machine.
The following page provides an overview for using the most basic methods of the `Dataset` class. See the [Dataset reference page ](../references/sdk/dataset.md )
2021-11-08 11:21:44 +00:00
for a complete list of available methods.
Import the `Dataset` class, and let's get started!
```python
from clearml import Dataset
```
## Creating Datasets
ClearML Data supports multiple ways to create datasets programmatically, which provides for a variety of use-cases:
* [`Dataset.create()` ](#datasetcreate ) - Create a new dataset. Parent datasets can be specified, from which the new dataset
will inherit its data
2023-04-16 07:13:04 +00:00
* [`Dataset.squash()` ](#datasetsquash ) - Generate a new dataset from by squashing together a set of related datasets
2021-11-08 11:21:44 +00:00
2022-12-26 08:34:56 +00:00
You can add metadata to your datasets using the `Dataset.set_metadata` method, and access the metadata using the
`Dataset.get_metadata` method. See [`set_metadata` ](../references/sdk/dataset.md#set_metadata ) and [`get_metadata` ](../references/sdk/dataset.md#get_metadata ).
2021-11-08 11:21:44 +00:00
### Dataset.create()
Use the [`Dataset.create` ](../references/sdk/dataset.md#datasetcreate ) class method to create a dataset.
Creating datasets programmatically is especially helpful when preprocessing the data so that the
preprocessing code and the resulting dataset are saved in a single task (see `use_current_task` parameter in [`Dataset.create` ](../references/sdk/dataset.md#datasetcreate )).
```python
# Preprocessing code here
dataset = Dataset.create(
dataset_name='dataset name',
dataset_project='dataset project',
2022-06-30 06:27:45 +00:00
parent_datasets=[PARENT_DS_ID_1, PARENT_DS_ID_2],
dataset_version="1.0",
2022-08-10 16:37:22 +00:00
output_uri="gs://bucket-name/folder",
description='my dataset description'
2021-11-08 11:21:44 +00:00
)
```
2021-12-26 13:09:03 +00:00
:::tip Locating Dataset ID
2022-06-30 17:48:42 +00:00
For datasets created with `clearml` v1.6 or newer on ClearML Server v1.6 or newer, find the ID in the dataset version’ s info panel in the [Dataset UI ](../webapp/datasets/webapp_dataset_viewing.md ).
For datasets created with earlier versions of `clearml` , or if using an earlier version of ClearML Server, find the ID in the task header of the [dataset task's info panel ](../webapp/webapp_exp_track_visual.md ).
2021-12-26 13:09:03 +00:00
:::
2022-09-14 09:30:19 +00:00
:::info Dataset Version
Input the dataset's version using the [semantic versioning ](https://semver.org ) scheme (e.g. `1.0.1` , `2.0` ). If a version
is not input, the method tries finding the latest dataset version with the specified `dataset_name` and `dataset_project`
and auto-increments the version number.
:::
2022-06-30 06:27:45 +00:00
Use the `output_uri` parameter to specify a network storage target to upload the dataset files, and associated information
(such as previews) to (e.g. `s3://bucket/data` , `gs://bucket/data` , `azure://bucket/data` , `file:///mnt/share/data` ).
2022-07-03 07:54:10 +00:00
By default, the dataset uploads to ClearML's file server. The `output_uri` parameter of the [`Dataset.upload` ](#uploading-files )
method overrides this parameter’ s value.
2022-06-30 06:27:45 +00:00
2021-11-08 11:21:44 +00:00
The created dataset inherits the content of the `parent_datasets` . When multiple dataset parents are listed,
they are merged in order of specification. Each parent overrides any overlapping files from a previous parent dataset.
### Dataset.squash()
To improve deep dataset DAG storage and speed, dataset squashing was introduced. The [`Dataset.squash` ](../references/sdk/dataset.md#datasetsquash )
class method generates a new dataset by squashing a set of dataset versions, and merging down all changes introduced in
their lineage DAG, creating a new, flat, independent version.
The datasets being squashed into a single dataset can be specified by their IDs or by project & name pairs.
```python
# option 1 - list dataset IDs
squashed_dataset_1 = Dataset.squash(
dataset_name='squashed dataset\'s name',
dataset_ids=[DS1_ID, DS2_ID, DS3_ID]
)
# option 2 - list project and dataset pairs
squashed_dataset_2 = Dataset.squash(
dataset_name='squashed dataset 2',
dataset_project_name_pairs=[('dataset1 project', 'dataset1 name'),
('dataset2 project', 'dataset2 name')]
)
```
2022-06-30 06:27:45 +00:00
In addition, the target storage location for the squashed dataset can be specified using the `output_uri` parameter of the
2021-11-08 11:21:44 +00:00
[`Dataset.squash` ](../references/sdk/dataset.md#datasetsquash ) method.
## Accessing Datasets
Once a dataset has been created and uploaded to a server, the dataset can be accessed programmatically from anywhere.
2022-06-30 06:27:45 +00:00
Use the [`Dataset.get` ](../references/sdk/dataset.md#datasetget ) class method to access a specific Dataset object, by
providing any of the dataset’ s following attributes: dataset ID, project, name, tags, and or version. If multiple
datasets match the query, the most recent one is returned.
```python
dataset = Dataset.get(
dataset_id=None,
dataset_project="Example Project",
dataset_name="Example Dataset",
dataset_tags="my tag",
dataset_version="1.2",
only_completed=True,
only_published=False,
)
```
Pass `auto_create=True` , and a dataset will be created on-the-fly with the input attributes (project name, dataset name,
and tags) if no datasets match the query.
2023-01-12 10:49:55 +00:00
In cases where you use a dataset in a task (e.g. consuming a dataset), you can have its ID stored in the task’ s
hyperparameters: pass `alias=<dataset_alias_string>` , and the task using the dataset will store the dataset’ s ID in the
`dataset_alias_string` parameter under the `Datasets` hyperparameters section. This way you can easily track which
2022-06-30 06:27:45 +00:00
dataset the task is using. If you use `alias` with `overridable=True` , you can override the dataset ID from the UI’ s
2023-01-12 10:49:55 +00:00
**CONFIGURATION > HYPERPARAMETERS >** `Datasets` section, allowing you to change the dataset used when running a task
2022-06-30 06:27:45 +00:00
remotely.
In case you want to get a modifiable dataset, you can get a newly created mutable dataset with the current one as its
parent, by passing `writable_copy=True` .
2021-11-08 11:21:44 +00:00
Once a specific dataset object has been obtained, get a local copy of the dataset using one of the following options:
* [`Dataset.get_local_copy()` ](../references/sdk/dataset.md#get_local_copy ) - get a read-only local copy of an entire dataset.
2021-11-09 13:58:40 +00:00
This method returns a path to the dataset in local cache (downloading the dataset if it is not already in cache).
2021-11-08 11:21:44 +00:00
* [`Dataset.get_mutable_local_copy()` ](../references/sdk/dataset.md#get_mutable_local_copy ) - get a writable local copy
of an entire dataset. This method downloads the dataset to a specific folder (non-cached), specified with the `target_folder` parameter. If
the specified folder already has contents, specify whether to overwrite its contents with the dataset contents, using the `overwrite` parameter.
2022-06-26 06:27:34 +00:00
ClearML supports parallel downloading of datasets. Use the `max_workers` parameter of the `Dataset.get_local_copy` or
`Dataset.get_mutable_copy` methods to specify the number of threads to use when downloading the dataset. By default, it’ s
the number of your machine’ s logical cores.
2021-11-08 11:21:44 +00:00
## Modifying Datasets
2021-11-09 13:58:40 +00:00
Once a dataset has been created, its contents can be modified and replaced. When your data is changed, you can
2021-11-08 11:21:44 +00:00
add updated files or remove unnecessary files.
### add_files()
2022-05-10 06:49:37 +00:00
To add local files or folders into the current dataset, use the [`Dataset.add_files` ](../references/sdk/dataset.md#add_files )
method.
If a file is already in a dataset, but it has been modified, it can be added again, and ClearML will
2021-11-08 11:21:44 +00:00
upload the file diff.
```python
2022-08-10 16:37:22 +00:00
dataset = Dataset.create(dataset_name="my dataset", dataset_project="example project")
2021-11-08 11:21:44 +00:00
dataset.add_files(path="path/to/folder_or_file")
```
There is an option to add a set of files based on wildcard matching of a single string or a list of strings, using the
`wildcard` parameter. Specify whether to match the wildcard files recursively using the `recursive` parameter.
For example:
```python
dataset.add_files(
path="path/to/folder",
wildcard="~/data/*.jpg",
recursive=True
)
```
2022-05-10 06:49:37 +00:00
### add_external_files()
To add files or folders to the current dataset, leaving them in their original location, use the [`Dataset.add_external_files` ](../references/sdk/dataset.md#add_external_files )
2022-11-02 14:10:56 +00:00
method. Input the `source_url` argument, which can be a link or a list of links from cloud storage (`s3://`, `gs://` , `azure://` )
or local / network storage (`file://`).
2022-05-10 06:49:37 +00:00
```python
2022-08-10 16:37:22 +00:00
dataset = Dataset.create(dataset_name="my dataset", dataset_project="example project")
2022-05-10 06:49:37 +00:00
dataset.add_external_files(
source_url="s3://my/bucket/path_to_folder_or_file",
dataset_path="/my_dataset/new_folder/"
2022-11-02 14:10:56 +00:00
)
dataset.add_external_files(
source_url=[
"s3://my/bucket/path_to_folder_or_file",
"s3://my/bucket/path_to_another_folder_or_file",
],
dataset_path="/my_dataset/new_folder/"
)
2022-05-10 06:49:37 +00:00
```
There is an option to add a set of files based on wildcard matching of a single string or a list of wildcards, using the
`wildcard` parameter. Specify whether to match the wildcard files recursively using the `recursive` parameter.
```python
# Add all jpg files located in s3 bucket called "my_bucket" to the dataset:
dataset.add_external_files(
source_url="s3://my/bucket/",
wildcard = "*.jpg",
dataset_path="/my_dataset/new_folder/"
)
```
2021-11-08 11:21:44 +00:00
### remove_files()
To remove files from a current dataset, use the [`Dataset.remove_files` ](../references/sdk/dataset.md#remove_files ) method.
2022-05-10 06:49:37 +00:00
Input the path to the folder or file to be removed in the `dataset_path` parameter. The path is relative to the dataset.
To remove links, specify their URL (e.g. `s3://bucket/file` ).
2021-11-08 11:21:44 +00:00
There is also an option to input a wildcard into `dataset_path` in order to remove a set of files matching the wildcard.
Set the `recursive` parameter to `True` in order to match all wildcard files recursively
For example:
```python
dataset.remove_files(dataset_path="*.csv", recursive=True)
```
## Uploading Files
To upload the dataset files to network storage, use the [`Dataset.upload` ](../references/sdk/dataset.md#upload ) method.
2022-06-26 06:27:34 +00:00
2021-11-08 11:21:44 +00:00
Use the `output_url` parameter to specify storage target, such as S3 / GS / Azure (e.g. `s3://bucket/data` , `gs://bucket/data` , `azure://bucket/data` , `/mnt/share/data` ).
2022-06-30 06:27:45 +00:00
By default, the dataset uploads to ClearML's file server. This target storage overrides the `output_uri` value of the
[`Dataset.create` ](#creating-datasets ) method.
2021-11-08 11:21:44 +00:00
2022-06-26 06:27:34 +00:00
ClearML supports parallel uploading of datasets. Use the `max_workers` parameter to specify the number of threads to use
when uploading the dataset. By default, it’ s the number of your machine’ s logical cores.
2021-11-08 11:21:44 +00:00
Dataset files must be uploaded before a dataset is [finalized ](#finalizing-a-dataset ).
## Finalizing a Dataset
Use the [`Dataset.finalize` ](../references/sdk/dataset.md#finalize ) method to close the current dataset. This marks the
dataset task as *Completed* , at which point, the dataset can no longer be modified.
Before closing a dataset, its files must first be [uploaded ](#uploading-files ).
## Syncing Local Storage
Use the [`Dataset.sync_folder` ](../references/sdk/dataset.md#sync_folder ) method in order to update a dataset according
to a specific folder's content changes. Specify the folder to sync with the `local_path` parameter (the method assumes all files within the folder and recursive).
This method is useful in the case where there's a single point of truth, either a local or network folder, that gets updated periodically.
The folder changes will be reflected in a new dataset version. This method saves time since you don't have to manually
update (add / remove) files in a dataset.
2022-06-30 06:27:45 +00:00
## Deleting Datasets
Delete a dataset using the [`Dataset.delete` ](../references/sdk/dataset.md#datasetdelete ) class method. Input any of the
attributes of the dataset(s) you want to delete, including ID, project name, version, and/or dataset name. Multiple
datasets matching the query will raise an exception, unless you pass `entire_dataset=True` and `force=True` . In this
case, all matching datasets will be deleted.
If a dataset is a parent to a dataset(s), you must pass `force=True` in order to delete it.
2023-02-09 13:13:17 +00:00
:::caution
2022-06-30 06:27:45 +00:00
Deleting a parent dataset may cause child datasets to lose data!
:::
```python
Dataset.delete(
dataset_id=None,
dataset_project="example project",
dataset_name="example dataset",
force=False,
dataset_version="3.0",
entire_dataset=False
)
```
## Renaming Datasets
Rename a dataset using the [`Dataset.rename` ](../references/sdk/dataset.md#datasetrename ) class method. All the datasets
with the given `dataset_project` and `dataset_name` will be renamed.
```python
Dataset.rename(
new_dataset_name="New name",
dataset_project="Example project",
dataset_name="Example dataset",
)
```
## Moving Datasets to Another Project
Move a dataset to another project using the [`Dataset.move_to_project` ](../references/sdk/dataset.md#datasetmove_to_projetc )
class method. All the datasets with the given `dataset_project` and `dataset_name` will be moved to the new dataset
project.
```python
Dataset.move_to_project(
new_dataset_project="New project",
dataset_project="Example project",
dataset_name="Example dataset",
)
```
2023-05-28 07:01:36 +00:00
## Offline Mode
You can work with datasets in **Offline Mode** , in which all the data and logs are stored in a local session folder,
which can later be uploaded to the [ClearML Server ](../deploying_clearml/clearml_server.md ).
You can enable offline mode in one of the following ways:
2023-05-28 08:43:48 +00:00
* Before creating a dataset, use [`Dataset.set_offline()` ](../references/sdk/dataset.md#datasetset_offline ) and set the
2023-05-28 07:01:36 +00:00
`offline_mode` argument to `True` :
```python
from clearml import Dataset
# Use the set_offline class method before creating a Dataset
Dataset.set_offline(offline_mode=True)
# Create a dataset
dataset = Dataset.create(dataset_name="Dataset example", dataset_project="Example project")
# add files to dataset
dataset.add_files(path='my_image.jpg')
```
* Before creating a dataset, set `CLEARML_OFFLINE_MODE=1`
All the dataset’ s information is zipped and is saved locally.
The dataset task's console output displays the task’ s ID and a path to the local dataset folder:
```
ClearML Task: created new task id=offline-372657bb04444c25a31bc6af86552cc9
...
...
ClearML Task: Offline session stored in /home/user/.clearml/cache/offline/b786845decb14eecadf2be24affc7418.zip
```
Note that in offline mode, any methods that require communicating with the server have no effect (e.g. `squash()` ,
`finalize()` , `get_local_copy()` , `get()` , `move_to_project()` , etc.).
2023-05-28 08:43:48 +00:00
Upload the offline dataset to the ClearML Server using [`Dataset.import_offline_session()` ](../references/sdk/dataset.md#datasetimport_offline_session ).
2023-05-28 07:01:36 +00:00
```python
Dataset.import_offline_session(session_folder_zip="< path_to_offline_dataset > ", upload=True, finalize=True")
```
In the `session_folder_zip` argument, insert the path to the zip folder containing the dataset. To [upload ](#uploading-files )
the dataset's data to network storage, set `upload` to `True` . To [finalize ](#finalizing-a-dataset ) the dataset,
which will close it and prevent further modifications to the dataset, set `finalize` to `True` .