Getting Started Refactor part 2

This commit is contained in:
revital
2025-02-23 09:29:36 +02:00
parent 4327b43f58
commit 07a887a481
6 changed files with 139 additions and 9 deletions

View File

@@ -1,5 +1,5 @@
---
title: Auto-log Experiments
title: Auto-logging Experiments
---
In ClearML, experiments are organized as [Tasks](../fundamentals/task.md).

View File

@@ -0,0 +1,131 @@
---
title: Managing Your Data
---
Data is probably one of the biggest factors that determines the success of a project. Associating a model's data with
the model's configuration, code, and results (such as accuracy) is key to deducing meaningful insights into model behavior.
[ClearML Data](../clearml_data/clearml_data.md) lets you:
* Version your data
* Fetch your data from every machine with minimal code changes
* Use the data with any other task
* Associate data to task results.
ClearML offers the following data management solutions:
* `clearml.Dataset` - A Python interface for creating, retrieving, managing, and using datasets. See [SDK](../clearml_data/clearml_data_sdk.md)
for an overview of the basic methods of the Dataset module.
* `clearml-data` - A CLI utility for creating, uploading, and managing datasets. See [CLI](../clearml_data/clearml_data_cli.md)
for a reference of `clearml-data` commands.
* Hyper-Datasets - ClearML's advanced queryable dataset management solution. For more information, see [Hyper-Datasets](../hyperdatasets/overview.md)
The following guide will use both the `clearml-data` CLI and the `Dataset` class to do the following:
1. Create a ClearML dataset
2. Access the dataset from a ClearML Task in order to preprocess the data
3. Create a new version of the dataset with the modified data
4. Use the new version of the dataset to train a model
## Creating Dataset
Let's assume you have some code that extracts data from a production database into a local folder.
Your goal is to create an immutable copy of the data to be used by further steps.
1. Create the dataset using the `clearml-data create` command and passing the dataset's project and name. You can add a
`latest` tag, making it easier to find it later.
```bash
clearml-data create --project chatbot_data --name dataset_v1 --latest
```
1. Add data to the dataset using `clearml-data sync` and passing the path of the folder to be added to the dataset.
This command also uploads the data and finalizes the dataset automatically.
```bash
clearml-data sync --folder ./work_dataset
```
## Preprocessing Data
The second step is to preprocess the data. First access the data, then modify it,
and lastly create a new version of the data.
1. Create a task for you data preprocessing (not required):
```python
from clearml import Task, Dataset
# create a task for the data processing
task = Task.init(project_name='data', task_name='create', task_type='data_processing')
```
1. Access a dataset using [`Dataset.get()`](../references/sdk/dataset.md#datasetget):
```python
# get the v1 dataset
dataset = Dataset.get(dataset_project='data', dataset_name='dataset_v1')
```
1. Get a local mutable copy of the dataset using [`Dataset.get_mutable_local_copy`](../references/sdk/dataset.md#get_mutable_local_copy). \
This downloads the dataset to a specified `target_folder` (non-cached). If the folder already has contents, specify
whether to overwrite its contents with the dataset contents using the `overwrite` parameter.
```python
# get a local mutable copy of the dataset
dataset_folder = dataset.get_mutable_local_copy(
target_folder='work_dataset',
overwrite=True
)
```
1. Preprocess the data, including modifying some files in the `./work_dataset` folder.
1. Create a new version of the dataset:
```python
# create a new version of the dataset with the pickle file
new_dataset = Dataset.create(
dataset_project='data',
dataset_name='dataset_v2',
parent_datasets=[dataset],
# this will make sure we have the creation code and the actual dataset artifacts on the same Task
use_current_task=True,
)
1. Add the modified data to the dataset:
```python
new_dataset.sync_folder(local_path=dataset_folder)
new_dataset.upload()
new_dataset.finalize()
```
1. Remove the `latest` tag from the previous dataset and add the tag to the new dataset:
```python
# now let's remove the previous dataset tag
dataset.tags = []
new_dataset.tags = ['latest']
```
The new dataset inherits the contents of the datasets specified in `Dataset.create`'s `parent_datasets` argument.
This not only helps trace back dataset changes with full genealogy, but also makes the storage more efficient,
since it only stores the changed and/or added files from the parent versions.
When you access the dataset, it automatically merges the files from all parent versions
in a fully automatic and transparent process, as if the files were always part of the requested Dataset.
## Training
You can now train your model with the **latest** dataset you have in the system, by getting the instance of the Dataset
based on the `latest` tag (if you have two Datasets with the same tag you will get the newest).
Once you have the dataset you can request a local copy of the data. All local copy requests are cached,
which means that if you access the same dataset multiple times you will not have any unnecessary downloads.
```python
# create a task for the model training
task = Task.init(project_name='data', task_name='ingest', task_type='training')
# get the latest dataset with the tag `latest`
dataset = Dataset.get(dataset_tags='latest')
# get a cached copy of the Dataset files
dataset_folder = dataset.get_local_copy()
# train model here
```

View File

@@ -75,7 +75,7 @@ Calling `get()` gets a deserialized pickled object.
Check out the [artifacts retrieval](https://github.com/clearml/clearml/blob/master/examples/reporting/artifacts_retrieval.py) example code.
### Models
## Models
Models are a special kind of artifact.
Models created by popular frameworks (such as PyTorch, TensorFlow, Scikit-learn) are automatically logged by ClearML.
@@ -101,7 +101,7 @@ Check out model snapshots examples for [TensorFlow](https://github.com/clearml/c
[Keras](https://github.com/clearml/clearml/blob/master/examples/frameworks/keras/keras_tensorboard.py),
[scikit-learn](https://github.com/clearml/clearml/blob/master/examples/frameworks/scikit-learn/sklearn_joblib_example.py).
#### Loading Models
### Loading Models
Loading a previously trained model is quite similar to loading artifacts.
```python
@@ -111,9 +111,11 @@ local_weights_path = last_snapshot.get_local_copy()
```
Like before, you have to get the instance of the task training the original weights files, then you can query the task for its output models (a list of snapshots), and get the latest snapshot.
:::note
Using TensorFlow, the snapshots are stored in a folder, meaning the `local_weights_path` will point to a folder containing your requested snapshot.
:::
As with artifacts, all models are cached, meaning the next time you run this code, no model needs to be downloaded.
Once one of the frameworks will load the weights file, the running task will be automatically updated with "Input Model" pointing directly to the original training Task's Model.
This feature lets you easily get a full genealogy of every trained and used model by your system!

View File

@@ -1,5 +1,5 @@
---
title: Reproduce Tasks
title: Reproducing Tasks
---
:::note

View File

@@ -1,5 +1,5 @@
---
title: Track Tasks
title: Tracking Tasks
---
Every ClearML [task](../fundamentals/task.md) you create can be found in the **All Tasks** table and in its project's

View File

@@ -61,10 +61,7 @@ module.exports = {
'getting_started/track_tasks',
'getting_started/reproduce_tasks',
'getting_started/logging_using_artifacts',
/* {'MLOps and LLMOps': [
'getting_started/mlops/mlops_second_steps',
]}*/
'getting_started/data_management',
'hpo',
{"Deploying Model Endpoints": [
{