Getting Started Refactor part 3

2025-06-26 18:17:44 +00:00 · 2025-02-23 09:55:57 +02:00
parent 07a887a481
commit d7fc4d922e
3 changed files with 27 additions and 122 deletions
--- a/docs/getting_started/building_pipelines.md
+++ b/docs/getting_started/building_pipelines.md
@@ -0,0 +1,25 @@
+---
+title: Building Pipelines
+---
+
+
+Pipelines are a way to streamline and connect multiple processes, plugging the output of one process as the input of another.
+
+ClearML Pipelines are implemented by a Controller Task that holds the logic of the pipeline steps' interactions. The 
+execution logic controls which step to launch based on parent steps completing their execution. Depending on the 
+specifications laid out in the controller task, a step's parameters can be overridden, enabling users to leverage other 
+steps' execution products such as artifacts and parameters.
+
+When run, the controller will sequentially launch the pipeline steps. Pipelines can be executed locally or 
+on any machine using the [clearml-agent](../clearml_agent.md).
+
+ClearML pipelines are created from code using one of the following:
+* [PipelineController class](../pipelines/pipelines_sdk_tasks.md) - A pythonic interface for defining and configuring the 
+  pipeline controller and its steps. The controller and steps can be functions in your Python code or existing ClearML tasks.
+* [PipelineDecorator class](../pipelines/pipelines_sdk_function_decorators.md) - A set of Python decorators which transform 
+  your functions into the pipeline controller and steps
+
+For more information, see [ClearML Pipelines](../pipelines/pipelines.md).
+
+![Pipeline DAG](../img/webapp_pipeline_DAG.png#light-mode-only)
+![Pipeline DAG](../img/webapp_pipeline_DAG_dark.png#dark-mode-only)
--- a/docs/getting_started/mlops/mlops_second_steps.md
+++ b/docs/getting_started/mlops/mlops_second_steps.md
@@ -1,121 +0,0 @@
---
-title: Next Steps
---
-
-Once Tasks are defined and in the ClearML system, they can be chained together to create Pipelines.
-Pipelines provide users with a greater level of abstraction and automation, with Tasks running one after the other.
-
-Tasks can interface with other Tasks in the pipeline and leverage other Tasks' work products.
-
-The sections below describe the following scenarios: 
-* [Dataset creation](#dataset-creation)
-* Data [processing](#preprocessing-data) and [consumption](#training)  
-* [Pipeline building](#building-the-pipeline)
-
-
-## Building Tasks
-### Dataset Creation
-
-Let's assume you have some code that extracts data from a production database into a local folder.
-Your goal is to create an immutable copy of the data to be used by further steps:
-
-```bash
-clearml-data create --project data --name dataset
-clearml-data sync --folder ./from_production 
-```
-
-You can add a tag `latest` to the Dataset, marking it as the latest version.
-
-### Preprocessing Data
-The second step is to preprocess the data. First access the data, then modify it,
-and lastly create a new version of the data.
-
-```python
-from clearml import Task, Dataset
-
-# create a task for the data processing part
-task = Task.init(project_name='data', task_name='create', task_type='data_processing')
-
-# get the v1 dataset
-dataset = Dataset.get(dataset_project='data', dataset_name='dataset_v1')
-
-# get a local mutable copy of the dataset
-dataset_folder = dataset.get_mutable_local_copy(
-    target_folder='work_dataset', 
-    overwrite=True
-)
-# change some files in the `./work_dataset` folder
-
-# create a new version of the dataset with the pickle file
-new_dataset = Dataset.create(
-    dataset_project='data', 
-    dataset_name='dataset_v2', 
-    parent_datasets=[dataset], 
-    # this will make sure we have the creation code and the actual dataset artifacts on the same Task 
-    use_current_task=True,
-)
-new_dataset.sync_folder(local_path=dataset_folder)
-new_dataset.upload()
-new_dataset.finalize()
-# now let's remove the previous dataset tag
-dataset.tags = []
-new_dataset.tags = ['latest']
-```
-
-The new dataset inherits the contents of the datasets specified in `Dataset.create`'s `parent_datasets` argument.
-This not only helps trace back dataset changes with full genealogy, but also makes the storage more efficient,
-since it only stores the changed and/or added files from the parent versions.
-When you access the Dataset, it automatically merges the files from all parent versions 
-in a fully automatic and transparent process, as if the files were always part of the requested Dataset.
-
-### Training
-You can now train your model with the **latest** Dataset you have in the system, by getting the instance of the Dataset 
-based on the `latest` tag 
-(if by any chance you have two Datasets with the same tag you will get the newest).
-Once you have the dataset you can request a local copy of the data. All local copy requests are cached,
-which means that if you access the same dataset multiple times you will not have any unnecessary downloads.
-
-```python
-# create a task for the model training
-task = Task.init(project_name='data', task_name='ingest', task_type='training')
-
-# get the latest dataset with the tag `latest`
-dataset = Dataset.get(dataset_tags='latest')
-
-# get a cached copy of the Dataset files 
-dataset_folder = dataset.get_local_copy()
-
-# train our model here
-```
-
-## Building the Pipeline
-
-Now that you have the data creation step, and the data training step, create a pipeline that when executed,
-will first run the first and then run the second.
-It is important to remember that pipelines are Tasks by themselves and can also be automated by other pipelines (i.e. pipelines of pipelines).
-
-```python
-from clearml import PipelineController
-
-pipe = PipelineController(
-    project='data', 
-    name='pipeline demo',
-    version="1.0"
-)
-
-pipe.add_step(
-    name='step 1 data',
-    base_project_name='data', 
-    base_task_name='create'  
-)
-pipe.add_step(
-    name='step 2 train', 
-    parents=['step 1 data', ],
-    base_project_name='data', 
-    base_task_name='ingest'
-)
-```
-
-You can also pass the parameters from one step to the other (for example `Task.id`).
-In addition to pipelines made up of Task steps, ClearML also supports pipelines consisting of function steps. For more 
-information, see the [full pipeline documentation](../../pipelines/pipelines.md).
--- a/sidebars.js
+++ b/sidebars.js
@@ -62,6 +62,7 @@ module.exports = {
        'getting_started/reproduce_tasks',
        'getting_started/logging_using_artifacts',
        'getting_started/data_management',
+        'getting_started/building_pipelines',
        'hpo',
        {"Deploying Model Endpoints": [
            {
@@ -84,7 +85,7 @@ module.exports = {
                ]
            }
        ]},
-        {"Launch a Remote IDE": [
+        {"Launching a Remote IDE": [
            'apps/clearml_session',
            {type: 'ref', id: 'webapp/applications/apps_ssh_session'},
            {type: 'ref', id: 'webapp/applications/apps_jupyter_lab'},