clearml-docs/docs/getting_started/data_management.md

---
title: Managing Your Data
---

Data is probably one of the biggest factors that determines the success of a project. Associating a model's data with
the model's configuration, code, and results (such as accuracy) is key to deducing meaningful insights into model behavior.

[ClearML Data](../clearml_data/clearml_data.md) lets you:
* Version your data
* Fetch your data from every machine with minimal code changes
* Use the data with any other task
* Associate data to task results.

ClearML offers the following data management solutions:

* `clearml.Dataset` - A Python interface for creating, retrieving, managing, and using datasets. See [SDK](../clearml_data/clearml_data_sdk.md) 
  for an overview of the basic methods of the Dataset module.
* `clearml-data` - A CLI utility for creating, uploading, and managing datasets. See [CLI](../clearml_data/clearml_data_cli.md) 
  for a reference of `clearml-data` commands.
* Hyper-Datasets - ClearML's advanced queryable dataset management solution. For more information, see [Hyper-Datasets](../hyperdatasets/overview.md)

The following guide will use both the `clearml-data` CLI and the `Dataset` class to do the following:
1. Create a ClearML dataset 
2. Access the dataset from a ClearML Task in order to preprocess the data
3. Create a new version of the dataset with the modified data 
4. Use the new version of the dataset to train a model

## Creating Dataset

Let's assume you have some code that extracts data from a production database into a local folder.
Your goal is to create an immutable copy of the data to be used by further steps.

1. Create the dataset using the `clearml-data create` command and passing the dataset's project and name. You can add a 
   `latest` tag, making it easier to find it later.

    ```bash
    clearml-data create --project chatbot_data --name dataset_v1 --latest
    ```

1. Add data to the dataset using `clearml-data sync` and passing the path of the folder to be added to the dataset.
   This command also uploads the data and finalizes the dataset automatically.

    ```bash
    clearml-data sync --folder ./work_dataset 
    ```


## Preprocessing Data
The second step is to preprocess the data. First access the data, then modify it,
and lastly create a new version of the data.

1. Create a task for you data preprocessing (not required):
   
   ```python
   from clearml import Task, Dataset

   # create a task for the data processing
   task = Task.init(project_name='data', task_name='create', task_type='data_processing')
   ``` 

1. Access a dataset using [`Dataset.get()`](../references/sdk/dataset.md#datasetget):

   ```python
   # get the v1 dataset
   dataset = Dataset.get(dataset_project='data', dataset_name='dataset_v1')
   ``` 
1. Get a local mutable copy of the dataset using [`Dataset.get_mutable_local_copy`](../references/sdk/dataset.md#get_mutable_local_copy). \
   This downloads the dataset to a specified `target_folder` (non-cached). If the folder already has contents, specify 
   whether to overwrite its contents with the dataset contents using the `overwrite` parameter.

   ```python
   # get a local mutable copy of the dataset
   dataset_folder = dataset.get_mutable_local_copy(
       target_folder='work_dataset', 
       overwrite=True
   )
   ```

1. Preprocess the data, including modifying some files in the `./work_dataset` folder.

1. Create a new version of the dataset: 

   ```python
   # create a new version of the dataset with the pickle file
   new_dataset = Dataset.create(
       dataset_project='data', 
       dataset_name='dataset_v2', 
       parent_datasets=[dataset], 
       # this will make sure we have the creation code and the actual dataset artifacts on the same Task 
       use_current_task=True,
   )

1. Add the modified data to the dataset: 

   ```python
   new_dataset.sync_folder(local_path=dataset_folder)
   new_dataset.upload()
   new_dataset.finalize()
   ```

1. Remove the `latest` tag from the previous dataset and add the tag to the new dataset:
   ```python
   # now let's remove the previous dataset tag
   dataset.tags = []
   new_dataset.tags = ['latest']
   ```

The new dataset inherits the contents of the datasets specified in `Dataset.create`'s `parent_datasets` argument.
This not only helps trace back dataset changes with full genealogy, but also makes the storage more efficient,
since it only stores the changed and/or added files from the parent versions.
When you access the dataset, it automatically merges the files from all parent versions 
in a fully automatic and transparent process, as if the files were always part of the requested Dataset.

## Training
You can now train your model with the **latest** dataset you have in the system, by getting the instance of the Dataset 
based on the `latest` tag (if you have two Datasets with the same tag you will get the newest).
Once you have the dataset you can request a local copy of the data. All local copy requests are cached,
which means that if you access the same dataset multiple times you will not have any unnecessary downloads.

```python
# create a task for the model training
task = Task.init(project_name='data', task_name='ingest', task_type='training')

# get the latest dataset with the tag `latest`
dataset = Dataset.get(dataset_tags='latest')

# get a cached copy of the Dataset files 
dataset_folder = dataset.get_local_copy()

# train model here
```
Restructure docs for platform components and use case clarity (#1048) 2025-02-23 15:33:55 +00:00			`---`
			`title: Managing Your Data`
			`---`

			`Data is probably one of the biggest factors that determines the success of a project. Associating a model's data with`
			`the model's configuration, code, and results (such as accuracy) is key to deducing meaningful insights into model behavior.`

			`[ClearML Data](../clearml_data/clearml_data.md) lets you:`
			`* Version your data`
			`* Fetch your data from every machine with minimal code changes`
			`* Use the data with any other task`
			`* Associate data to task results.`

			`ClearML offers the following data management solutions:`

			* `clearml.Dataset` - A Python interface for creating, retrieving, managing, and using datasets. See [SDK](../clearml_data/clearml_data_sdk.md)
			`for an overview of the basic methods of the Dataset module.`
			* `clearml-data` - A CLI utility for creating, uploading, and managing datasets. See [CLI](../clearml_data/clearml_data_cli.md)
			for a reference of `clearml-data` commands.
			`* Hyper-Datasets - ClearML's advanced queryable dataset management solution. For more information, see [Hyper-Datasets](../hyperdatasets/overview.md)`

			The following guide will use both the `clearml-data` CLI and the `Dataset` class to do the following:
			`1. Create a ClearML dataset`
			`2. Access the dataset from a ClearML Task in order to preprocess the data`
			`3. Create a new version of the dataset with the modified data`
			`4. Use the new version of the dataset to train a model`

			`## Creating Dataset`

			`Let's assume you have some code that extracts data from a production database into a local folder.`
			`Your goal is to create an immutable copy of the data to be used by further steps.`

			1. Create the dataset using the `clearml-data create` command and passing the dataset's project and name. You can add a
			`latest` tag, making it easier to find it later.

			```bash
			`clearml-data create --project chatbot_data --name dataset_v1 --latest`
			```

			1. Add data to the dataset using `clearml-data sync` and passing the path of the folder to be added to the dataset.
			`This command also uploads the data and finalizes the dataset automatically.`

			```bash
			`clearml-data sync --folder ./work_dataset`
			```


			`## Preprocessing Data`
			`The second step is to preprocess the data. First access the data, then modify it,`
			`and lastly create a new version of the data.`

			`1. Create a task for you data preprocessing (not required):`

			```python
			`from clearml import Task, Dataset`

			`# create a task for the data processing`
			`task = Task.init(project_name='data', task_name='create', task_type='data_processing')`
			```

			1. Access a dataset using [`Dataset.get()`](../references/sdk/dataset.md#datasetget):

			```python
			`# get the v1 dataset`
			`dataset = Dataset.get(dataset_project='data', dataset_name='dataset_v1')`
			```
			1. Get a local mutable copy of the dataset using [`Dataset.get_mutable_local_copy`](../references/sdk/dataset.md#get_mutable_local_copy). \
			This downloads the dataset to a specified `target_folder` (non-cached). If the folder already has contents, specify
			whether to overwrite its contents with the dataset contents using the `overwrite` parameter.

			```python
			`# get a local mutable copy of the dataset`
			`dataset_folder = dataset.get_mutable_local_copy(`
			`target_folder='work_dataset',`
			`overwrite=True`
			`)`
			```

			1. Preprocess the data, including modifying some files in the `./work_dataset` folder.

			`1. Create a new version of the dataset:`

			```python
			`# create a new version of the dataset with the pickle file`
			`new_dataset = Dataset.create(`
			`dataset_project='data',`
			`dataset_name='dataset_v2',`
			`parent_datasets=[dataset],`
			`# this will make sure we have the creation code and the actual dataset artifacts on the same Task`
			`use_current_task=True,`
			`)`

			`1. Add the modified data to the dataset:`

			```python
			`new_dataset.sync_folder(local_path=dataset_folder)`
			`new_dataset.upload()`
			`new_dataset.finalize()`
			```

			1. Remove the `latest` tag from the previous dataset and add the tag to the new dataset:
			```python
			`# now let's remove the previous dataset tag`
			`dataset.tags = []`
			`new_dataset.tags = ['latest']`
			```

			The new dataset inherits the contents of the datasets specified in `Dataset.create`'s `parent_datasets` argument.
			`This not only helps trace back dataset changes with full genealogy, but also makes the storage more efficient,`
			`since it only stores the changed and/or added files from the parent versions.`
			`When you access the dataset, it automatically merges the files from all parent versions`
			`in a fully automatic and transparent process, as if the files were always part of the requested Dataset.`

			`## Training`
			`You can now train your model with the latest dataset you have in the system, by getting the instance of the Dataset`
			based on the `latest` tag (if you have two Datasets with the same tag you will get the newest).
			`Once you have the dataset you can request a local copy of the data. All local copy requests are cached,`
			`which means that if you access the same dataset multiple times you will not have any unnecessary downloads.`

			```python
			`# create a task for the model training`
			`task = Task.init(project_name='data', task_name='ingest', task_type='training')`

			# get the latest dataset with the tag `latest`
			`dataset = Dataset.get(dataset_tags='latest')`

			`# get a cached copy of the Dataset files`
			`dataset_folder = dataset.get_local_copy()`

			`# train model here`
			```