diff --git a/docs/getting_started/building_pipelines.md b/docs/getting_started/building_pipelines.md new file mode 100644 index 00000000..a6a7466d --- /dev/null +++ b/docs/getting_started/building_pipelines.md @@ -0,0 +1,25 @@ +--- +title: Building Pipelines +--- + + +Pipelines are a way to streamline and connect multiple processes, plugging the output of one process as the input of another. + +ClearML Pipelines are implemented by a Controller Task that holds the logic of the pipeline steps' interactions. The +execution logic controls which step to launch based on parent steps completing their execution. Depending on the +specifications laid out in the controller task, a step's parameters can be overridden, enabling users to leverage other +steps' execution products such as artifacts and parameters. + +When run, the controller will sequentially launch the pipeline steps. Pipelines can be executed locally or +on any machine using the [clearml-agent](../clearml_agent.md). + +ClearML pipelines are created from code using one of the following: +* [PipelineController class](../pipelines/pipelines_sdk_tasks.md) - A pythonic interface for defining and configuring the + pipeline controller and its steps. The controller and steps can be functions in your Python code or existing ClearML tasks. +* [PipelineDecorator class](../pipelines/pipelines_sdk_function_decorators.md) - A set of Python decorators which transform + your functions into the pipeline controller and steps + +For more information, see [ClearML Pipelines](../pipelines/pipelines.md). + +![Pipeline DAG](../img/webapp_pipeline_DAG.png#light-mode-only) +![Pipeline DAG](../img/webapp_pipeline_DAG_dark.png#dark-mode-only) \ No newline at end of file diff --git a/docs/getting_started/mlops/mlops_second_steps.md b/docs/getting_started/mlops/mlops_second_steps.md deleted file mode 100644 index aa56772b..00000000 --- a/docs/getting_started/mlops/mlops_second_steps.md +++ /dev/null @@ -1,121 +0,0 @@ ---- -title: Next Steps ---- - -Once Tasks are defined and in the ClearML system, they can be chained together to create Pipelines. -Pipelines provide users with a greater level of abstraction and automation, with Tasks running one after the other. - -Tasks can interface with other Tasks in the pipeline and leverage other Tasks' work products. - -The sections below describe the following scenarios: -* [Dataset creation](#dataset-creation) -* Data [processing](#preprocessing-data) and [consumption](#training) -* [Pipeline building](#building-the-pipeline) - - -## Building Tasks -### Dataset Creation - -Let's assume you have some code that extracts data from a production database into a local folder. -Your goal is to create an immutable copy of the data to be used by further steps: - -```bash -clearml-data create --project data --name dataset -clearml-data sync --folder ./from_production -``` - -You can add a tag `latest` to the Dataset, marking it as the latest version. - -### Preprocessing Data -The second step is to preprocess the data. First access the data, then modify it, -and lastly create a new version of the data. - -```python -from clearml import Task, Dataset - -# create a task for the data processing part -task = Task.init(project_name='data', task_name='create', task_type='data_processing') - -# get the v1 dataset -dataset = Dataset.get(dataset_project='data', dataset_name='dataset_v1') - -# get a local mutable copy of the dataset -dataset_folder = dataset.get_mutable_local_copy( - target_folder='work_dataset', - overwrite=True -) -# change some files in the `./work_dataset` folder - -# create a new version of the dataset with the pickle file -new_dataset = Dataset.create( - dataset_project='data', - dataset_name='dataset_v2', - parent_datasets=[dataset], - # this will make sure we have the creation code and the actual dataset artifacts on the same Task - use_current_task=True, -) -new_dataset.sync_folder(local_path=dataset_folder) -new_dataset.upload() -new_dataset.finalize() -# now let's remove the previous dataset tag -dataset.tags = [] -new_dataset.tags = ['latest'] -``` - -The new dataset inherits the contents of the datasets specified in `Dataset.create`'s `parent_datasets` argument. -This not only helps trace back dataset changes with full genealogy, but also makes the storage more efficient, -since it only stores the changed and/or added files from the parent versions. -When you access the Dataset, it automatically merges the files from all parent versions -in a fully automatic and transparent process, as if the files were always part of the requested Dataset. - -### Training -You can now train your model with the **latest** Dataset you have in the system, by getting the instance of the Dataset -based on the `latest` tag -(if by any chance you have two Datasets with the same tag you will get the newest). -Once you have the dataset you can request a local copy of the data. All local copy requests are cached, -which means that if you access the same dataset multiple times you will not have any unnecessary downloads. - -```python -# create a task for the model training -task = Task.init(project_name='data', task_name='ingest', task_type='training') - -# get the latest dataset with the tag `latest` -dataset = Dataset.get(dataset_tags='latest') - -# get a cached copy of the Dataset files -dataset_folder = dataset.get_local_copy() - -# train our model here -``` - -## Building the Pipeline - -Now that you have the data creation step, and the data training step, create a pipeline that when executed, -will first run the first and then run the second. -It is important to remember that pipelines are Tasks by themselves and can also be automated by other pipelines (i.e. pipelines of pipelines). - -```python -from clearml import PipelineController - -pipe = PipelineController( - project='data', - name='pipeline demo', - version="1.0" -) - -pipe.add_step( - name='step 1 data', - base_project_name='data', - base_task_name='create' -) -pipe.add_step( - name='step 2 train', - parents=['step 1 data', ], - base_project_name='data', - base_task_name='ingest' -) -``` - -You can also pass the parameters from one step to the other (for example `Task.id`). -In addition to pipelines made up of Task steps, ClearML also supports pipelines consisting of function steps. For more -information, see the [full pipeline documentation](../../pipelines/pipelines.md). diff --git a/sidebars.js b/sidebars.js index f56c0101..c14d1b3d 100644 --- a/sidebars.js +++ b/sidebars.js @@ -62,6 +62,7 @@ module.exports = { 'getting_started/reproduce_tasks', 'getting_started/logging_using_artifacts', 'getting_started/data_management', + 'getting_started/building_pipelines', 'hpo', {"Deploying Model Endpoints": [ { @@ -84,7 +85,7 @@ module.exports = { ] } ]}, - {"Launch a Remote IDE": [ + {"Launching a Remote IDE": [ 'apps/clearml_session', {type: 'ref', id: 'webapp/applications/apps_ssh_session'}, {type: 'ref', id: 'webapp/applications/apps_jupyter_lab'},