clearml-docs/docs/clearml_data/best_practices.md

60 lines
3.3 KiB
Markdown
Raw Normal View History

2021-11-08 11:21:44 +00:00
---
title: Best Practices
---
:::important
This page covers `clearml-data`, ClearML's file-based data management solution.
See [Hyper-Datasets](../hyperdatasets/overview.md) for ClearML's advanced queryable dataset management solution.
:::
2021-11-08 11:21:44 +00:00
The following are some recommendations for using ClearML Data.
2023-10-02 15:29:45 +00:00
![Dataset UI gif](../img/gif/dataset.gif)
2022-11-23 10:13:17 +00:00
2021-11-08 11:21:44 +00:00
## Versioning Datasets
Use ClearML Data to version your datasets. Once a dataset is finalized, it can no longer be modified. This makes clear
2025-02-06 15:31:11 +00:00
which version of the dataset was used with which task, enabling the accurate reproduction of your tasks.
2021-11-08 11:21:44 +00:00
Once you need to change the dataset's contents, you can create a new version of the dataset by specifying the previous
dataset as a parent. This makes the new dataset version inherit the previous version's contents, with the dataset's new
version contents ready to be updated.
## Organize Datasets for Easier Access
Organize the datasets according to use-cases and use tags. This makes managing multiple datasets and
accessing the most updated datasets for different use-cases easier.
2021-12-21 11:42:44 +00:00
Like any ClearML tasks, datasets can be organized into [projects (and subprojects)](../fundamentals/projects.md#creating-subprojects).
2021-11-08 11:21:44 +00:00
Additionally, when creating a dataset, tags can be applied to the dataset, which will make searching for the dataset easier.
Organizing your datasets into projects by use-case makes it easier to access the most recent dataset version for that use-case.
2023-12-07 16:33:28 +00:00
If only a project is specified when using [`Dataset.get()`](../references/sdk/dataset.md#datasetget), the method returns the
2021-11-08 11:21:44 +00:00
most recent dataset in a project. The same is true with tags; if a tag is specified, the method will return the most recent dataset that is labeled with that tag.
2023-01-22 12:46:30 +00:00
In cases where you use a dataset in a task (e.g. consuming a dataset), you can easily track which dataset the task is
2023-12-07 16:33:28 +00:00
using by using `Dataset.get()`'s `alias` parameter. Pass `alias=<dataset_alias_string>`, and the task using the dataset
2023-10-09 12:48:19 +00:00
will store the dataset's ID in the `dataset_alias_string` parameter under the task's **CONFIGURATION > HYPERPARAMETERS >
2023-01-23 13:04:24 +00:00
Datasets** section.
2023-01-22 12:46:30 +00:00
2021-11-08 11:21:44 +00:00
## Document your Datasets
2023-12-07 16:33:28 +00:00
Attach informative metrics or debug samples to the Dataset itself. Use [`Dataset.get_logger()`](../references/sdk/dataset.md#get_logger)
to access the dataset's logger object, then add any additional information to the dataset, using the methods
2024-03-06 13:00:50 +00:00
available with a [`Logger`](../references/sdk/logger.md) object.
2021-11-08 11:21:44 +00:00
You can add some dataset summaries (like [table reporting](../references/sdk/logger.md#report_table)) to create a preview
of the data stored for better visibility, or attach any statistics generated by the data ingestion process.
## Periodically Update Your Dataset
Your data probably changes from time to time. If the data is updated into the same local / network folder structure, which
serves as a dataset's single point of truth, you can schedule a script which uses the dataset `sync` functionality which
will update the dataset based on the modifications made to the folder. This way, there is no need to manually modify a dataset.
This functionality will also track the modifications made to a folder.
See the sync function with the [CLI](clearml_data_cli.md#sync) or [SDK](clearml_data_sdk.md#syncing-local-storage)
2021-11-08 11:21:44 +00:00
interface.