mirror of
https://github.com/clearml/clearml-docs
synced 2025-06-26 18:17:44 +00:00
Getting Started Refactor part 3
This commit is contained in:
25
docs/getting_started/building_pipelines.md
Normal file
25
docs/getting_started/building_pipelines.md
Normal file
@@ -0,0 +1,25 @@
|
||||
---
|
||||
title: Building Pipelines
|
||||
---
|
||||
|
||||
|
||||
Pipelines are a way to streamline and connect multiple processes, plugging the output of one process as the input of another.
|
||||
|
||||
ClearML Pipelines are implemented by a Controller Task that holds the logic of the pipeline steps' interactions. The
|
||||
execution logic controls which step to launch based on parent steps completing their execution. Depending on the
|
||||
specifications laid out in the controller task, a step's parameters can be overridden, enabling users to leverage other
|
||||
steps' execution products such as artifacts and parameters.
|
||||
|
||||
When run, the controller will sequentially launch the pipeline steps. Pipelines can be executed locally or
|
||||
on any machine using the [clearml-agent](../clearml_agent.md).
|
||||
|
||||
ClearML pipelines are created from code using one of the following:
|
||||
* [PipelineController class](../pipelines/pipelines_sdk_tasks.md) - A pythonic interface for defining and configuring the
|
||||
pipeline controller and its steps. The controller and steps can be functions in your Python code or existing ClearML tasks.
|
||||
* [PipelineDecorator class](../pipelines/pipelines_sdk_function_decorators.md) - A set of Python decorators which transform
|
||||
your functions into the pipeline controller and steps
|
||||
|
||||
For more information, see [ClearML Pipelines](../pipelines/pipelines.md).
|
||||
|
||||

|
||||

|
||||
@@ -1,121 +0,0 @@
|
||||
---
|
||||
title: Next Steps
|
||||
---
|
||||
|
||||
Once Tasks are defined and in the ClearML system, they can be chained together to create Pipelines.
|
||||
Pipelines provide users with a greater level of abstraction and automation, with Tasks running one after the other.
|
||||
|
||||
Tasks can interface with other Tasks in the pipeline and leverage other Tasks' work products.
|
||||
|
||||
The sections below describe the following scenarios:
|
||||
* [Dataset creation](#dataset-creation)
|
||||
* Data [processing](#preprocessing-data) and [consumption](#training)
|
||||
* [Pipeline building](#building-the-pipeline)
|
||||
|
||||
|
||||
## Building Tasks
|
||||
### Dataset Creation
|
||||
|
||||
Let's assume you have some code that extracts data from a production database into a local folder.
|
||||
Your goal is to create an immutable copy of the data to be used by further steps:
|
||||
|
||||
```bash
|
||||
clearml-data create --project data --name dataset
|
||||
clearml-data sync --folder ./from_production
|
||||
```
|
||||
|
||||
You can add a tag `latest` to the Dataset, marking it as the latest version.
|
||||
|
||||
### Preprocessing Data
|
||||
The second step is to preprocess the data. First access the data, then modify it,
|
||||
and lastly create a new version of the data.
|
||||
|
||||
```python
|
||||
from clearml import Task, Dataset
|
||||
|
||||
# create a task for the data processing part
|
||||
task = Task.init(project_name='data', task_name='create', task_type='data_processing')
|
||||
|
||||
# get the v1 dataset
|
||||
dataset = Dataset.get(dataset_project='data', dataset_name='dataset_v1')
|
||||
|
||||
# get a local mutable copy of the dataset
|
||||
dataset_folder = dataset.get_mutable_local_copy(
|
||||
target_folder='work_dataset',
|
||||
overwrite=True
|
||||
)
|
||||
# change some files in the `./work_dataset` folder
|
||||
|
||||
# create a new version of the dataset with the pickle file
|
||||
new_dataset = Dataset.create(
|
||||
dataset_project='data',
|
||||
dataset_name='dataset_v2',
|
||||
parent_datasets=[dataset],
|
||||
# this will make sure we have the creation code and the actual dataset artifacts on the same Task
|
||||
use_current_task=True,
|
||||
)
|
||||
new_dataset.sync_folder(local_path=dataset_folder)
|
||||
new_dataset.upload()
|
||||
new_dataset.finalize()
|
||||
# now let's remove the previous dataset tag
|
||||
dataset.tags = []
|
||||
new_dataset.tags = ['latest']
|
||||
```
|
||||
|
||||
The new dataset inherits the contents of the datasets specified in `Dataset.create`'s `parent_datasets` argument.
|
||||
This not only helps trace back dataset changes with full genealogy, but also makes the storage more efficient,
|
||||
since it only stores the changed and/or added files from the parent versions.
|
||||
When you access the Dataset, it automatically merges the files from all parent versions
|
||||
in a fully automatic and transparent process, as if the files were always part of the requested Dataset.
|
||||
|
||||
### Training
|
||||
You can now train your model with the **latest** Dataset you have in the system, by getting the instance of the Dataset
|
||||
based on the `latest` tag
|
||||
(if by any chance you have two Datasets with the same tag you will get the newest).
|
||||
Once you have the dataset you can request a local copy of the data. All local copy requests are cached,
|
||||
which means that if you access the same dataset multiple times you will not have any unnecessary downloads.
|
||||
|
||||
```python
|
||||
# create a task for the model training
|
||||
task = Task.init(project_name='data', task_name='ingest', task_type='training')
|
||||
|
||||
# get the latest dataset with the tag `latest`
|
||||
dataset = Dataset.get(dataset_tags='latest')
|
||||
|
||||
# get a cached copy of the Dataset files
|
||||
dataset_folder = dataset.get_local_copy()
|
||||
|
||||
# train our model here
|
||||
```
|
||||
|
||||
## Building the Pipeline
|
||||
|
||||
Now that you have the data creation step, and the data training step, create a pipeline that when executed,
|
||||
will first run the first and then run the second.
|
||||
It is important to remember that pipelines are Tasks by themselves and can also be automated by other pipelines (i.e. pipelines of pipelines).
|
||||
|
||||
```python
|
||||
from clearml import PipelineController
|
||||
|
||||
pipe = PipelineController(
|
||||
project='data',
|
||||
name='pipeline demo',
|
||||
version="1.0"
|
||||
)
|
||||
|
||||
pipe.add_step(
|
||||
name='step 1 data',
|
||||
base_project_name='data',
|
||||
base_task_name='create'
|
||||
)
|
||||
pipe.add_step(
|
||||
name='step 2 train',
|
||||
parents=['step 1 data', ],
|
||||
base_project_name='data',
|
||||
base_task_name='ingest'
|
||||
)
|
||||
```
|
||||
|
||||
You can also pass the parameters from one step to the other (for example `Task.id`).
|
||||
In addition to pipelines made up of Task steps, ClearML also supports pipelines consisting of function steps. For more
|
||||
information, see the [full pipeline documentation](../../pipelines/pipelines.md).
|
||||
@@ -62,6 +62,7 @@ module.exports = {
|
||||
'getting_started/reproduce_tasks',
|
||||
'getting_started/logging_using_artifacts',
|
||||
'getting_started/data_management',
|
||||
'getting_started/building_pipelines',
|
||||
'hpo',
|
||||
{"Deploying Model Endpoints": [
|
||||
{
|
||||
@@ -84,7 +85,7 @@ module.exports = {
|
||||
]
|
||||
}
|
||||
]},
|
||||
{"Launch a Remote IDE": [
|
||||
{"Launching a Remote IDE": [
|
||||
'apps/clearml_session',
|
||||
{type: 'ref', id: 'webapp/applications/apps_ssh_session'},
|
||||
{type: 'ref', id: 'webapp/applications/apps_jupyter_lab'},
|
||||
|
||||
Reference in New Issue
Block a user