From 07a887a48186c9caccc7c33ea442f82940f36c2e Mon Sep 17 00:00:00 2001
From: revital <revital@allegro.ai>
Date: Sun, 23 Feb 2025 09:29:36 +0200
Subject: [PATCH] Getting Started Refactor part 2

---
 docs/getting_started/auto_log_exp.md          |   2 +-
 docs/getting_started/data_management.md       | 131 ++++++++++++++++++
 .../logging_using_artifacts.md                |   6 +-
 docs/getting_started/reproduce_tasks.md       |   2 +-
 docs/getting_started/track_tasks.md           |   2 +-
 sidebars.js                                   |   5 +-
 6 files changed, 139 insertions(+), 9 deletions(-)
 create mode 100644 docs/getting_started/data_management.md

diff --git a/docs/getting_started/auto_log_exp.md b/docs/getting_started/auto_log_exp.md
index e554bebf..2f3b44d8 100644
--- a/docs/getting_started/auto_log_exp.md
+++ b/docs/getting_started/auto_log_exp.md
@@ -1,5 +1,5 @@
 ---
-title: Auto-log Experiments 
+title: Auto-logging Experiments 
 ---
 
 In ClearML, experiments are organized as [Tasks](../fundamentals/task.md).
diff --git a/docs/getting_started/data_management.md b/docs/getting_started/data_management.md
new file mode 100644
index 00000000..3064a51f
--- /dev/null
+++ b/docs/getting_started/data_management.md
@@ -0,0 +1,131 @@
+---
+title: Managing Your Data
+---
+
+Data is probably one of the biggest factors that determines the success of a project. Associating a model's data with
+the model's configuration, code, and results (such as accuracy) is key to deducing meaningful insights into model behavior.
+
+[ClearML Data](../clearml_data/clearml_data.md) lets you:
+* Version your data
+* Fetch your data from every machine with minimal code changes
+* Use the data with any other task
+* Associate data to task results.
+
+ClearML offers the following data management solutions:
+
+* `clearml.Dataset` - A Python interface for creating, retrieving, managing, and using datasets. See [SDK](../clearml_data/clearml_data_sdk.md) 
+  for an overview of the basic methods of the Dataset module.
+* `clearml-data` - A CLI utility for creating, uploading, and managing datasets. See [CLI](../clearml_data/clearml_data_cli.md) 
+  for a reference of `clearml-data` commands.
+* Hyper-Datasets - ClearML's advanced queryable dataset management solution. For more information, see [Hyper-Datasets](../hyperdatasets/overview.md)
+
+The following guide will use both the `clearml-data` CLI and the `Dataset` class to do the following:
+1. Create a ClearML dataset 
+2. Access the dataset from a ClearML Task in order to preprocess the data
+3. Create a new version of the dataset with the modified data 
+4. Use the new version of the dataset to train a model
+
+## Creating Dataset
+
+Let's assume you have some code that extracts data from a production database into a local folder.
+Your goal is to create an immutable copy of the data to be used by further steps.
+
+1. Create the dataset using the `clearml-data create` command and passing the dataset's project and name. You can add a 
+   `latest` tag, making it easier to find it later.
+
+    ```bash
+    clearml-data create --project chatbot_data --name dataset_v1 --latest
+    ```
+
+1. Add data to the dataset using `clearml-data sync` and passing the path of the folder to be added to the dataset.
+   This command also uploads the data and finalizes the dataset automatically.
+
+    ```bash
+    clearml-data sync --folder ./work_dataset 
+    ```
+
+
+## Preprocessing Data
+The second step is to preprocess the data. First access the data, then modify it,
+and lastly create a new version of the data.
+
+1. Create a task for you data preprocessing (not required):
+   
+   ```python
+   from clearml import Task, Dataset
+
+   # create a task for the data processing
+   task = Task.init(project_name='data', task_name='create', task_type='data_processing')
+   ``` 
+
+1. Access a dataset using [`Dataset.get()`](../references/sdk/dataset.md#datasetget):
+
+   ```python
+   # get the v1 dataset
+   dataset = Dataset.get(dataset_project='data', dataset_name='dataset_v1')
+   ``` 
+1. Get a local mutable copy of the dataset using [`Dataset.get_mutable_local_copy`](../references/sdk/dataset.md#get_mutable_local_copy). \
+   This downloads the dataset to a specified `target_folder` (non-cached). If the folder already has contents, specify 
+   whether to overwrite its contents with the dataset contents using the `overwrite` parameter.
+
+   ```python
+   # get a local mutable copy of the dataset
+   dataset_folder = dataset.get_mutable_local_copy(
+       target_folder='work_dataset', 
+       overwrite=True
+   )
+   ```
+
+1. Preprocess the data, including modifying some files in the `./work_dataset` folder.
+
+1. Create a new version of the dataset: 
+
+   ```python
+   # create a new version of the dataset with the pickle file
+   new_dataset = Dataset.create(
+       dataset_project='data', 
+       dataset_name='dataset_v2', 
+       parent_datasets=[dataset], 
+       # this will make sure we have the creation code and the actual dataset artifacts on the same Task 
+       use_current_task=True,
+   )
+
+1. Add the modified data to the dataset: 
+
+   ```python
+   new_dataset.sync_folder(local_path=dataset_folder)
+   new_dataset.upload()
+   new_dataset.finalize()
+   ```
+
+1. Remove the `latest` tag from the previous dataset and add the tag to the new dataset:
+   ```python
+   # now let's remove the previous dataset tag
+   dataset.tags = []
+   new_dataset.tags = ['latest']
+   ```
+
+The new dataset inherits the contents of the datasets specified in `Dataset.create`'s `parent_datasets` argument.
+This not only helps trace back dataset changes with full genealogy, but also makes the storage more efficient,
+since it only stores the changed and/or added files from the parent versions.
+When you access the dataset, it automatically merges the files from all parent versions 
+in a fully automatic and transparent process, as if the files were always part of the requested Dataset.
+
+## Training
+You can now train your model with the **latest** dataset you have in the system, by getting the instance of the Dataset 
+based on the `latest` tag (if you have two Datasets with the same tag you will get the newest).
+Once you have the dataset you can request a local copy of the data. All local copy requests are cached,
+which means that if you access the same dataset multiple times you will not have any unnecessary downloads.
+
+```python
+# create a task for the model training
+task = Task.init(project_name='data', task_name='ingest', task_type='training')
+
+# get the latest dataset with the tag `latest`
+dataset = Dataset.get(dataset_tags='latest')
+
+# get a cached copy of the Dataset files 
+dataset_folder = dataset.get_local_copy()
+
+# train model here
+```
\ No newline at end of file
diff --git a/docs/getting_started/logging_using_artifacts.md b/docs/getting_started/logging_using_artifacts.md
index 3df11210..27cafc98 100644
--- a/docs/getting_started/logging_using_artifacts.md
+++ b/docs/getting_started/logging_using_artifacts.md
@@ -75,7 +75,7 @@ Calling `get()` gets a deserialized pickled object.
 
 Check out the [artifacts retrieval](https://github.com/clearml/clearml/blob/master/examples/reporting/artifacts_retrieval.py) example code.
 
-### Models
+## Models
 
 Models are a special kind of artifact.
 Models created by popular frameworks (such as PyTorch, TensorFlow, Scikit-learn) are automatically logged by ClearML.
@@ -101,7 +101,7 @@ Check out model snapshots examples for [TensorFlow](https://github.com/clearml/c
 [Keras](https://github.com/clearml/clearml/blob/master/examples/frameworks/keras/keras_tensorboard.py),
 [scikit-learn](https://github.com/clearml/clearml/blob/master/examples/frameworks/scikit-learn/sklearn_joblib_example.py).
 
-#### Loading Models
+### Loading Models
 Loading a previously trained model is quite similar to loading artifacts.
 
 ```python
@@ -111,9 +111,11 @@ local_weights_path = last_snapshot.get_local_copy()
 ```
 
 Like before, you have to get the instance of the task training the original weights files, then you can query the task for its output models (a list of snapshots), and get the latest snapshot.
+
 :::note
 Using TensorFlow, the snapshots are stored in a folder, meaning the `local_weights_path` will point to a folder containing your requested snapshot.
 :::
+
 As with artifacts, all models are cached, meaning the next time you run this code, no model needs to be downloaded.
 Once one of the frameworks will load the weights file, the running task will be automatically updated with "Input Model" pointing directly to the original training Task's Model.
 This feature lets you easily get a full genealogy of every trained and used model by your system!
diff --git a/docs/getting_started/reproduce_tasks.md b/docs/getting_started/reproduce_tasks.md
index f580cc1a..57bb1a98 100644
--- a/docs/getting_started/reproduce_tasks.md
+++ b/docs/getting_started/reproduce_tasks.md
@@ -1,5 +1,5 @@
 ---
-title: Reproduce Tasks 
+title: Reproducing Tasks 
 ---
 
 :::note
diff --git a/docs/getting_started/track_tasks.md b/docs/getting_started/track_tasks.md
index 2dccb6a9..0b8223f6 100644
--- a/docs/getting_started/track_tasks.md
+++ b/docs/getting_started/track_tasks.md
@@ -1,5 +1,5 @@
 ---
-title: Track Tasks
+title: Tracking Tasks
 ---
 
 Every ClearML [task](../fundamentals/task.md) you create can be found in the **All Tasks** table and in its project's 
diff --git a/sidebars.js b/sidebars.js
index 50a41fed..f56c0101 100644
--- a/sidebars.js
+++ b/sidebars.js
@@ -61,10 +61,7 @@ module.exports = {
         'getting_started/track_tasks',
         'getting_started/reproduce_tasks',
         'getting_started/logging_using_artifacts',
-/*                {'MLOps and LLMOps': [
-
-                    'getting_started/mlops/mlops_second_steps',
-                ]}*/
+        'getting_started/data_management',
         'hpo',
         {"Deploying Model Endpoints": [
             {