clearml-docs/docs/clearml_data/best_practices.md

---
title: Best Practices
---

:::important
This page covers `clearml-data`, ClearML's file-based data management solution.
See [Hyper-Datasets](../hyperdatasets/overview.md) for ClearML's advanced queryable dataset management solution.
:::

The following are some recommendations for using ClearML Data. 

![Dataset UI gif](../img/gif/dataset.gif)

## Versioning Datasets

Use ClearML Data to version your datasets. Once a dataset is finalized, it can no longer be modified. This makes clear
which version of the dataset was used with which task, enabling the accurate reproduction of your tasks. 

Once you need to change the dataset's contents, you can create a new version of the dataset by specifying the previous 
dataset as a parent. This makes the new dataset version inherit the previous version's contents, with the dataset's new 
version contents ready to be updated. 

## Organize Datasets for Easier Access

Organize the datasets according to use-cases and use tags. This makes managing multiple datasets and 
accessing the most updated datasets for different use-cases easier. 

Like any ClearML tasks, datasets can be organized into [projects (and subprojects)](../fundamentals/projects.md#creating-subprojects). 
Additionally, when creating a dataset, tags can be applied to the dataset, which will make searching for the dataset easier.

Organizing your datasets into projects by use-case makes it easier to access the most recent dataset version for that use-case. 
If only a project is specified when using [`Dataset.get()`](../references/sdk/dataset.md#datasetget), the method returns the 
most recent dataset in a project. The same is true with tags; if a tag is specified, the method will return the most recent dataset that is labeled with that tag.

In cases where you use a dataset in a task (e.g. consuming a dataset), you can easily track which dataset the task is 
using by using `Dataset.get()`'s `alias` parameter. Pass `alias=<dataset_alias_string>`, and the task using the dataset 
will store the dataset's ID in the `dataset_alias_string` parameter under the task's **CONFIGURATION > HYPERPARAMETERS >
Datasets** section.


## Document your Datasets 

Attach informative metrics or debug samples to the Dataset itself. Use [`Dataset.get_logger()`](../references/sdk/dataset.md#get_logger)
to access the dataset's logger object, then add any additional information to the dataset, using the methods
available with a [`Logger`](../references/sdk/logger.md) object. 

You can add some dataset summaries (like [table reporting](../references/sdk/logger.md#report_table)) to create a preview 
of the data stored for better visibility, or attach any statistics generated by the data ingestion process. 


## Periodically Update Your Dataset 

Your data probably changes from time to time. If the data is updated into the same local / network folder structure, which 
serves as a dataset's single point of truth, you can schedule a script which uses the dataset `sync` functionality which 
will update the dataset based on the modifications made to the folder. This way, there is no need to manually modify a dataset. 
This functionality will also track the modifications made to a folder.

See the sync function with the [CLI](clearml_data_cli.md#sync) or [SDK](clearml_data_sdk.md#syncing-local-storage)
interface.
Refactor ClearML Data docs (#108) 2021-11-08 11:21:44 +00:00			`---`
			`title: Best Practices`
			`---`

Add ClearML Data admonitions (#115) * Add admonition to differentiate between hyperdatasets and ClearML Data * fix link * edit hyperdataset admonition 2021-11-10 12:52:06 +00:00			`:::important`
			This page covers `clearml-data`, ClearML's file-based data management solution.
			`See [Hyper-Datasets](../hyperdatasets/overview.md) for ClearML's advanced queryable dataset management solution.`
			`:::`

Refactor ClearML Data docs (#108) 2021-11-08 11:21:44 +00:00			`The following are some recommendations for using ClearML Data.`

Update Gifs (#682) 2023-10-02 15:29:45 +00:00			`![Dataset UI gif](../img/gif/dataset.gif)`
Update dataset images (#380) 2022-11-23 10:13:17 +00:00
Refactor ClearML Data docs (#108) 2021-11-08 11:21:44 +00:00			`## Versioning Datasets`

			`Use ClearML Data to version your datasets. Once a dataset is finalized, it can no longer be modified. This makes clear`
Change terminology (#1028) 2025-02-06 15:31:11 +00:00			`which version of the dataset was used with which task, enabling the accurate reproduction of your tasks.`
Refactor ClearML Data docs (#108) 2021-11-08 11:21:44 +00:00
			`Once you need to change the dataset's contents, you can create a new version of the dataset by specifying the previous`
			`dataset as a parent. This makes the new dataset version inherit the previous version's contents, with the dataset's new`
			`version contents ready to be updated.`

			`## Organize Datasets for Easier Access`

			`Organize the datasets according to use-cases and use tags. This makes managing multiple datasets and`
			`accessing the most updated datasets for different use-cases easier.`

Add projects page doc (#135) 2021-12-21 11:42:44 +00:00			`Like any ClearML tasks, datasets can be organized into [projects (and subprojects)](../fundamentals/projects.md#creating-subprojects).`
Refactor ClearML Data docs (#108) 2021-11-08 11:21:44 +00:00			`Additionally, when creating a dataset, tags can be applied to the dataset, which will make searching for the dataset easier.`

			`Organizing your datasets into projects by use-case makes it easier to access the most recent dataset version for that use-case.`
Small edits (#725) 2023-12-07 16:33:28 +00:00			If only a project is specified when using [`Dataset.get()`](../references/sdk/dataset.md#datasetget), the method returns the
Refactor ClearML Data docs (#108) 2021-11-08 11:21:44 +00:00			`most recent dataset in a project. The same is true with tags; if a tag is specified, the method will return the most recent dataset that is labeled with that tag.`

Add Dataset `alias` explanation (#449) 2023-01-22 12:46:30 +00:00			`In cases where you use a dataset in a task (e.g. consuming a dataset), you can easily track which dataset the task is`
Small edits (#725) 2023-12-07 16:33:28 +00:00			using by using `Dataset.get()`'s `alias` parameter. Pass `alias=<dataset_alias_string>`, and the task using the dataset
Small edits (#689) 2023-10-09 12:48:19 +00:00			will store the dataset's ID in the `dataset_alias_string` parameter under the task's **CONFIGURATION > HYPERPARAMETERS >
Small edits (#451) 2023-01-23 13:04:24 +00:00			`Datasets** section.`
Add Dataset `alias` explanation (#449) 2023-01-22 12:46:30 +00:00

Refactor ClearML Data docs (#108) 2021-11-08 11:21:44 +00:00			`## Document your Datasets`

Small edits (#725) 2023-12-07 16:33:28 +00:00			Attach informative metrics or debug samples to the Dataset itself. Use [`Dataset.get_logger()`](../references/sdk/dataset.md#get_logger)
			`to access the dataset's logger object, then add any additional information to the dataset, using the methods`
Small edits (#790) 2024-03-06 13:00:50 +00:00			available with a [`Logger`](../references/sdk/logger.md) object.
Refactor ClearML Data docs (#108) 2021-11-08 11:21:44 +00:00
			`You can add some dataset summaries (like [table reporting](../references/sdk/logger.md#report_table)) to create a preview`
			`of the data stored for better visibility, or attach any statistics generated by the data ingestion process.`


			`## Periodically Update Your Dataset`

			`Your data probably changes from time to time. If the data is updated into the same local / network folder structure, which`
			serves as a dataset's single point of truth, you can schedule a script which uses the dataset `sync` functionality which
			`will update the dataset based on the modifications made to the folder. This way, there is no need to manually modify a dataset.`
			`This functionality will also track the modifications made to a folder.`

Add clearml-data dataset version info (#328) 2022-09-14 09:30:19 +00:00			`See the sync function with the [CLI](clearml_data_cli.md#sync) or [SDK](clearml_data_sdk.md#syncing-local-storage)`
Refactor ClearML Data docs (#108) 2021-11-08 11:21:44 +00:00			`interface.`