Getting Started Refactor part 3

This commit is contained in:
revital
2025-02-23 09:55:57 +02:00
parent 07a887a481
commit d7fc4d922e
3 changed files with 27 additions and 122 deletions

View File

@@ -0,0 +1,25 @@
---
title: Building Pipelines
---
Pipelines are a way to streamline and connect multiple processes, plugging the output of one process as the input of another.
ClearML Pipelines are implemented by a Controller Task that holds the logic of the pipeline steps' interactions. The
execution logic controls which step to launch based on parent steps completing their execution. Depending on the
specifications laid out in the controller task, a step's parameters can be overridden, enabling users to leverage other
steps' execution products such as artifacts and parameters.
When run, the controller will sequentially launch the pipeline steps. Pipelines can be executed locally or
on any machine using the [clearml-agent](../clearml_agent.md).
ClearML pipelines are created from code using one of the following:
* [PipelineController class](../pipelines/pipelines_sdk_tasks.md) - A pythonic interface for defining and configuring the
pipeline controller and its steps. The controller and steps can be functions in your Python code or existing ClearML tasks.
* [PipelineDecorator class](../pipelines/pipelines_sdk_function_decorators.md) - A set of Python decorators which transform
your functions into the pipeline controller and steps
For more information, see [ClearML Pipelines](../pipelines/pipelines.md).
![Pipeline DAG](../img/webapp_pipeline_DAG.png#light-mode-only)
![Pipeline DAG](../img/webapp_pipeline_DAG_dark.png#dark-mode-only)

View File

@@ -1,121 +0,0 @@
---
title: Next Steps
---
Once Tasks are defined and in the ClearML system, they can be chained together to create Pipelines.
Pipelines provide users with a greater level of abstraction and automation, with Tasks running one after the other.
Tasks can interface with other Tasks in the pipeline and leverage other Tasks' work products.
The sections below describe the following scenarios:
* [Dataset creation](#dataset-creation)
* Data [processing](#preprocessing-data) and [consumption](#training)
* [Pipeline building](#building-the-pipeline)
## Building Tasks
### Dataset Creation
Let's assume you have some code that extracts data from a production database into a local folder.
Your goal is to create an immutable copy of the data to be used by further steps:
```bash
clearml-data create --project data --name dataset
clearml-data sync --folder ./from_production
```
You can add a tag `latest` to the Dataset, marking it as the latest version.
### Preprocessing Data
The second step is to preprocess the data. First access the data, then modify it,
and lastly create a new version of the data.
```python
from clearml import Task, Dataset
# create a task for the data processing part
task = Task.init(project_name='data', task_name='create', task_type='data_processing')
# get the v1 dataset
dataset = Dataset.get(dataset_project='data', dataset_name='dataset_v1')
# get a local mutable copy of the dataset
dataset_folder = dataset.get_mutable_local_copy(
target_folder='work_dataset',
overwrite=True
)
# change some files in the `./work_dataset` folder
# create a new version of the dataset with the pickle file
new_dataset = Dataset.create(
dataset_project='data',
dataset_name='dataset_v2',
parent_datasets=[dataset],
# this will make sure we have the creation code and the actual dataset artifacts on the same Task
use_current_task=True,
)
new_dataset.sync_folder(local_path=dataset_folder)
new_dataset.upload()
new_dataset.finalize()
# now let's remove the previous dataset tag
dataset.tags = []
new_dataset.tags = ['latest']
```
The new dataset inherits the contents of the datasets specified in `Dataset.create`'s `parent_datasets` argument.
This not only helps trace back dataset changes with full genealogy, but also makes the storage more efficient,
since it only stores the changed and/or added files from the parent versions.
When you access the Dataset, it automatically merges the files from all parent versions
in a fully automatic and transparent process, as if the files were always part of the requested Dataset.
### Training
You can now train your model with the **latest** Dataset you have in the system, by getting the instance of the Dataset
based on the `latest` tag
(if by any chance you have two Datasets with the same tag you will get the newest).
Once you have the dataset you can request a local copy of the data. All local copy requests are cached,
which means that if you access the same dataset multiple times you will not have any unnecessary downloads.
```python
# create a task for the model training
task = Task.init(project_name='data', task_name='ingest', task_type='training')
# get the latest dataset with the tag `latest`
dataset = Dataset.get(dataset_tags='latest')
# get a cached copy of the Dataset files
dataset_folder = dataset.get_local_copy()
# train our model here
```
## Building the Pipeline
Now that you have the data creation step, and the data training step, create a pipeline that when executed,
will first run the first and then run the second.
It is important to remember that pipelines are Tasks by themselves and can also be automated by other pipelines (i.e. pipelines of pipelines).
```python
from clearml import PipelineController
pipe = PipelineController(
project='data',
name='pipeline demo',
version="1.0"
)
pipe.add_step(
name='step 1 data',
base_project_name='data',
base_task_name='create'
)
pipe.add_step(
name='step 2 train',
parents=['step 1 data', ],
base_project_name='data',
base_task_name='ingest'
)
```
You can also pass the parameters from one step to the other (for example `Task.id`).
In addition to pipelines made up of Task steps, ClearML also supports pipelines consisting of function steps. For more
information, see the [full pipeline documentation](../../pipelines/pipelines.md).

View File

@@ -62,6 +62,7 @@ module.exports = {
'getting_started/reproduce_tasks',
'getting_started/logging_using_artifacts',
'getting_started/data_management',
'getting_started/building_pipelines',
'hpo',
{"Deploying Model Endpoints": [
{
@@ -84,7 +85,7 @@ module.exports = {
]
}
]},
{"Launch a Remote IDE": [
{"Launching a Remote IDE": [
'apps/clearml_session',
{type: 'ref', id: 'webapp/applications/apps_ssh_session'},
{type: 'ref', id: 'webapp/applications/apps_jupyter_lab'},