Clarify pipeline step caching (#474)

This commit is contained in:
pollfly 2023-02-16 20:57:02 +02:00 committed by GitHub
parent 2cf096f7ec
commit bc06f92614
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 19 additions and 8 deletions

View File

@ -57,9 +57,15 @@ pipeline Task. To enable the automatic logging, use the `monitor_metrics`, `moni
when creating a pipeline step. when creating a pipeline step.
### Pipeline Step Caching ### Pipeline Step Caching
The Pipeline controller also offers step caching, meaning, reusing outputs of previously executed pipeline steps, in the The Pipeline controller supports step caching, meaning, reusing outputs of previously executed pipeline steps.
case of exact same step code, and the same step input values. By default, pipeline steps are not cached. Enable caching
when creating a pipeline step. Cached pipeline steps are reused when they meet the following criteria:
* The step code is the same, including environment setup (components in the task's [Execution](../webapp/webapp_exp_track_visual.md#execution)
section, like required packages and docker image)
* The step input arguments are unchanged, including step arguments and parameters (anything logged to the task's [Configuration](../webapp/webapp_exp_track_visual.md#configuration)
section)
By default, pipeline steps are not cached. Enable caching when creating a pipeline step (for example, see [@PipelineDecorator.component](pipelines_sdk_function_decorators.md#pipelinedecoratorcomponent)).
When a step is cached, the step code is hashed, alongside the steps parameters (as passed in runtime), into a single When a step is cached, the step code is hashed, alongside the steps parameters (as passed in runtime), into a single
representing hash string. The pipeline first checks if a cached step exists in the system (archived Tasks will not be used representing hash string. The pipeline first checks if a cached step exists in the system (archived Tasks will not be used

View File

@ -88,8 +88,9 @@ def step_one(pickle_data_url: str, extra: int = 43):
* `return_values` - The artifact names for the steps corresponding ClearML task to store the steps returned objects. * `return_values` - The artifact names for the steps corresponding ClearML task to store the steps returned objects.
In the example above, a single object is returned and stored as an artifact named `data_frame` In the example above, a single object is returned and stored as an artifact named `data_frame`
* `name` (Optional) - The name for the pipeline step. If not provided, the function name is used * `name` (Optional) - The name for the pipeline step. If not provided, the function name is used
* `cache` - If `True`, the pipeline controller checks if an identical step with the same parameters was already executed. * `cache` - If `True`, the pipeline controller checks if a step with the same code (including setup, see task [Execution](../webapp/webapp_exp_track_visual.md#execution)
If found, its outputs are used instead of rerunning the step. section) and input arguments was already executed. If found, the cached step's outputs are used
instead of rerunning the step.
* `packages` - A list of required packages or a local requirements.txt file. Example: `["tqdm>=2.1", "scikit-learn"]` or * `packages` - A list of required packages or a local requirements.txt file. Example: `["tqdm>=2.1", "scikit-learn"]` or
`"./requirements.txt"`. If not provided, packages are automatically added based on the imports used inside the function. `"./requirements.txt"`. If not provided, packages are automatically added based on the imports used inside the function.
* `execution_queue` (Optional) - Queue in which to enqueue the specific step. This overrides the queue set with the * `execution_queue` (Optional) - Queue in which to enqueue the specific step. This overrides the queue set with the

View File

@ -75,7 +75,9 @@ pipe.add_step(
* One of the following: * One of the following:
* `base_task_project` and `base_task_name` - Project and name of the base task to clone * `base_task_project` and `base_task_name` - Project and name of the base task to clone
* `base_task_id` - ID of the base task to clone * `base_task_id` - ID of the base task to clone
* `cache_executed_step` If `True`, the controller will check if an identical task with the same parameters was already executed. If it was found, its outputs will be used instead of launching a new task. * `cache_executed_step` If `True`, the controller will check if an identical task with the same code (including setup,
e.g. required packages, docker image, etc.) and input arguments was already executed. If found, the cached step's
outputs are used instead of launching a new task.
* `execution_queue` (Optional) - the queue to use for executing this specific step. If not provided, the task will be sent to the default execution queue, as defined on the class * `execution_queue` (Optional) - the queue to use for executing this specific step. If not provided, the task will be sent to the default execution queue, as defined on the class
* `parents` Optional list of parent steps in the pipeline. The current step in the pipeline will be sent for execution only after all the parent steps have been executed successfully. * `parents` Optional list of parent steps in the pipeline. The current step in the pipeline will be sent for execution only after all the parent steps have been executed successfully.
* `parameter_override` - Dictionary of parameters and values to override in the current step. See [parameter_override](#parameter_override). * `parameter_override` - Dictionary of parameters and values to override in the current step. See [parameter_override](#parameter_override).
@ -141,8 +143,10 @@ pipe.add_function_step(
* `function_kwargs` (optional) - A dictionary of function arguments and default values which are translated into task * `function_kwargs` (optional) - A dictionary of function arguments and default values which are translated into task
hyperparameters. If not provided, all function arguments are translated into hyperparameters. hyperparameters. If not provided, all function arguments are translated into hyperparameters.
* `function_return` - The names for storing the pipeline steps returned objects as artifacts in its ClearML task. * `function_return` - The names for storing the pipeline steps returned objects as artifacts in its ClearML task.
* `cache_executed_step` - If `True`, the controller checks if an identical task with the same parameters was already * `cache_executed_step` - If `True`, the controller will check if an identical task with the same code
executed. If it was found, its outputs are used instead of launching a new task. (including setup, see task [Execution](../webapp/webapp_exp_track_visual.md#execution)
section) and input arguments was already executed. If found, the cached step's
outputs are used instead of launching a new task.
* `parents` Optional list of parent steps in the pipeline. The current step in the pipeline will be sent for execution * `parents` Optional list of parent steps in the pipeline. The current step in the pipeline will be sent for execution
only after all the parent steps have been executed successfully. only after all the parent steps have been executed successfully.
* `pre_execute_callback` & `post_execute_callback` - Control pipeline flow with callback functions that can be called * `pre_execute_callback` & `post_execute_callback` - Control pipeline flow with callback functions that can be called