Refactor ClearML Data docs (#108)

2025-06-26 18:17:44 +00:00 · 2021-11-08 13:21:44 +02:00
parent 43751dc64b
commit e155c49cfd
17 changed files with 847 additions and 683 deletions
--- a/docs/clearml_data/best_practices.md
+++ b/docs/clearml_data/best_practices.md
@@ -0,0 +1,46 @@
+---
+title: Best Practices
+---
+
+The following are some recommendations for using ClearML Data. 
+
+## Versioning Datasets
+
+Use ClearML Data to version your datasets. Once a dataset is finalized, it can no longer be modified. This makes clear
+which version of the dataset was used with which task, enabling the accurate reproduction of your experiments. 
+
+Once you need to change the dataset's contents, you can create a new version of the dataset by specifying the previous 
+dataset as a parent. This makes the new dataset version inherit the previous version's contents, with the dataset's new 
+version contents ready to be updated. 
+
+## Organize Datasets for Easier Access
+
+Organize the datasets according to use-cases and use tags. This makes managing multiple datasets and 
+accessing the most updated datasets for different use-cases easier. 
+
+Like any ClearML tasks, datasets can be organized into [projects (and sub-projects)](../fundamentals/projects.md#creating-sub-projects). 
+Additionally, when creating a dataset, tags can be applied to the dataset, which will make searching for the dataset easier.
+
+Organizing your datasets into projects by use-case makes it easier to access the most recent dataset version for that use-case. 
+If only a project is specified when using [`Dataset.get`](../references/sdk/dataset.md#datasetget), the method returns the 
+most recent dataset in a project. The same is true with tags; if a tag is specified, the method will return the most recent dataset that is labeled with that tag.
+
+## Document your Datasets 
+
+Attach informative metrics or debug samples to the Dataset itself. Use the [`get_logger`](../references/sdk/dataset.md#get_logger)
+method to access the dataset's logger object, then add any additional information to the dataset, using the methods
+available with a [logger](../references/sdk/logger.md) object. 
+
+You can add some dataset summaries (like [table reporting](../references/sdk/logger.md#report_table)) to create a preview 
+of the data stored for better visibility, or attach any statistics generated by the data ingestion process. 
+
+
+## Periodically Update Your Dataset 
+
+Your data probably changes from time to time. If the data is updated into the same local / network folder structure, which 
+serves as a dataset's single point of truth, you can schedule a script which uses the dataset `sync` functionality which 
+will update the dataset based on the modifications made to the folder. This way, there is no need to manually modify a dataset. 
+This functionality will also track the modifications made to a folder.
+
+See the sync function with the [CLI](clearml_data_cli.md#syncing-local-storage) or [SDK](clearml_data_sdk.md#syncing-local-storage)
+interface.