5.3 KiB
title |
---|
Managing Your Data |
Data is probably one of the biggest factors that determines the success of a project. Associating a model's data with the model's configuration, code, and results (such as accuracy) is key to deducing meaningful insights into model behavior.
ClearML Data lets you:
- Version your data
- Fetch your data from every machine with minimal code changes
- Use the data with any other task
- Associate data to task results.
ClearML offers the following data management solutions:
clearml.Dataset
- A Python interface for creating, retrieving, managing, and using datasets. See SDK for an overview of the basic methods of the Dataset module.clearml-data
- A CLI utility for creating, uploading, and managing datasets. See CLI for a reference ofclearml-data
commands.- Hyper-Datasets - ClearML's advanced queryable dataset management solution. For more information, see Hyper-Datasets
The following guide will use both the clearml-data
CLI and the Dataset
class to do the following:
- Create a ClearML dataset
- Access the dataset from a ClearML Task in order to preprocess the data
- Create a new version of the dataset with the modified data
- Use the new version of the dataset to train a model
Creating Dataset
Let's assume you have some code that extracts data from a production database into a local folder. Your goal is to create an immutable copy of the data to be used by further steps.
-
Create the dataset using the
clearml-data create
command and passing the dataset's project and name. You can add alatest
tag, making it easier to find it later.clearml-data create --project chatbot_data --name dataset_v1 --latest
-
Add data to the dataset using
clearml-data sync
and passing the path of the folder to be added to the dataset. This command also uploads the data and finalizes the dataset automatically.clearml-data sync --folder ./work_dataset
Preprocessing Data
The second step is to preprocess the data. First access the data, then modify it, and lastly create a new version of the data.
-
Create a task for you data preprocessing (not required):
from clearml import Task, Dataset # create a task for the data processing task = Task.init(project_name='data', task_name='create', task_type='data_processing')
-
Access a dataset using
Dataset.get()
:# get the v1 dataset dataset = Dataset.get(dataset_project='data', dataset_name='dataset_v1')
-
Get a local mutable copy of the dataset using
Dataset.get_mutable_local_copy
.
This downloads the dataset to a specifiedtarget_folder
(non-cached). If the folder already has contents, specify whether to overwrite its contents with the dataset contents using theoverwrite
parameter.# get a local mutable copy of the dataset dataset_folder = dataset.get_mutable_local_copy( target_folder='work_dataset', overwrite=True )
-
Preprocess the data, including modifying some files in the
./work_dataset
folder. -
Create a new version of the dataset:
# create a new version of the dataset with the pickle file new_dataset = Dataset.create( dataset_project='data', dataset_name='dataset_v2', parent_datasets=[dataset], # this will make sure we have the creation code and the actual dataset artifacts on the same Task use_current_task=True, )
-
Add the modified data to the dataset:
new_dataset.sync_folder(local_path=dataset_folder) new_dataset.upload() new_dataset.finalize()
-
Remove the
latest
tag from the previous dataset and add the tag to the new dataset:# now let's remove the previous dataset tag dataset.tags = [] new_dataset.tags = ['latest']
The new dataset inherits the contents of the datasets specified in Dataset.create
's parent_datasets
argument.
This not only helps trace back dataset changes with full genealogy, but also makes the storage more efficient,
since it only stores the changed and/or added files from the parent versions.
When you access the dataset, it automatically merges the files from all parent versions
in a fully automatic and transparent process, as if the files were always part of the requested Dataset.
Training
You can now train your model with the latest dataset you have in the system, by getting the instance of the Dataset
based on the latest
tag (if you have two Datasets with the same tag you will get the newest).
Once you have the dataset you can request a local copy of the data. All local copy requests are cached,
which means that if you access the same dataset multiple times you will not have any unnecessary downloads.
# create a task for the model training
task = Task.init(project_name='data', task_name='ingest', task_type='training')
# get the latest dataset with the tag `latest`
dataset = Dataset.get(dataset_tags='latest')
# get a cached copy of the Dataset files
dataset_folder = dataset.get_local_copy()
# train model here