From bc06f92614539088e0982d453bf2a10b648ed7dd Mon Sep 17 00:00:00 2001 From: pollfly <75068813+pollfly@users.noreply.github.com> Date: Thu, 16 Feb 2023 20:57:02 +0200 Subject: [PATCH] Clarify pipeline step caching (#474) --- docs/pipelines/pipelines.md | 12 +++++++++--- docs/pipelines/pipelines_sdk_function_decorators.md | 5 +++-- docs/pipelines/pipelines_sdk_tasks.md | 10 +++++++--- 3 files changed, 19 insertions(+), 8 deletions(-) diff --git a/docs/pipelines/pipelines.md b/docs/pipelines/pipelines.md index 0dc42775..a57f3d2a 100644 --- a/docs/pipelines/pipelines.md +++ b/docs/pipelines/pipelines.md @@ -57,9 +57,15 @@ pipeline Task. To enable the automatic logging, use the `monitor_metrics`, `moni when creating a pipeline step. ### Pipeline Step Caching -The Pipeline controller also offers step caching, meaning, reusing outputs of previously executed pipeline steps, in the -case of exact same step code, and the same step input values. By default, pipeline steps are not cached. Enable caching -when creating a pipeline step. +The Pipeline controller supports step caching, meaning, reusing outputs of previously executed pipeline steps. + +Cached pipeline steps are reused when they meet the following criteria: +* The step code is the same, including environment setup (components in the task's [Execution](../webapp/webapp_exp_track_visual.md#execution) +section, like required packages and docker image) +* The step input arguments are unchanged, including step arguments and parameters (anything logged to the task's [Configuration](../webapp/webapp_exp_track_visual.md#configuration) +section) + +By default, pipeline steps are not cached. Enable caching when creating a pipeline step (for example, see [@PipelineDecorator.component](pipelines_sdk_function_decorators.md#pipelinedecoratorcomponent)). When a step is cached, the step code is hashed, alongside the step’s parameters (as passed in runtime), into a single representing hash string. The pipeline first checks if a cached step exists in the system (archived Tasks will not be used diff --git a/docs/pipelines/pipelines_sdk_function_decorators.md b/docs/pipelines/pipelines_sdk_function_decorators.md index db01537a..0a52e0b8 100644 --- a/docs/pipelines/pipelines_sdk_function_decorators.md +++ b/docs/pipelines/pipelines_sdk_function_decorators.md @@ -88,8 +88,9 @@ def step_one(pickle_data_url: str, extra: int = 43): * `return_values` - The artifact names for the step’s corresponding ClearML task to store the step’s returned objects. In the example above, a single object is returned and stored as an artifact named `data_frame` * `name` (Optional) - The name for the pipeline step. If not provided, the function name is used -* `cache` - If `True`, the pipeline controller checks if an identical step with the same parameters was already executed. - If found, its outputs are used instead of rerunning the step. +* `cache` - If `True`, the pipeline controller checks if a step with the same code (including setup, see task [Execution](../webapp/webapp_exp_track_visual.md#execution) + section) and input arguments was already executed. If found, the cached step's outputs are used + instead of rerunning the step. * `packages` - A list of required packages or a local requirements.txt file. Example: `["tqdm>=2.1", "scikit-learn"]` or `"./requirements.txt"`. If not provided, packages are automatically added based on the imports used inside the function. * `execution_queue` (Optional) - Queue in which to enqueue the specific step. This overrides the queue set with the diff --git a/docs/pipelines/pipelines_sdk_tasks.md b/docs/pipelines/pipelines_sdk_tasks.md index 37544545..a8f9d9e7 100644 --- a/docs/pipelines/pipelines_sdk_tasks.md +++ b/docs/pipelines/pipelines_sdk_tasks.md @@ -75,7 +75,9 @@ pipe.add_step( * One of the following: * `base_task_project` and `base_task_name` - Project and name of the base task to clone * `base_task_id` - ID of the base task to clone -* `cache_executed_step` – If `True`, the controller will check if an identical task with the same parameters was already executed. If it was found, its outputs will be used instead of launching a new task. +* `cache_executed_step` – If `True`, the controller will check if an identical task with the same code (including setup, + e.g. required packages, docker image, etc.) and input arguments was already executed. If found, the cached step's + outputs are used instead of launching a new task. * `execution_queue` (Optional) - the queue to use for executing this specific step. If not provided, the task will be sent to the default execution queue, as defined on the class * `parents` – Optional list of parent steps in the pipeline. The current step in the pipeline will be sent for execution only after all the parent steps have been executed successfully. * `parameter_override` - Dictionary of parameters and values to override in the current step. See [parameter_override](#parameter_override). @@ -141,8 +143,10 @@ pipe.add_function_step( * `function_kwargs` (optional) - A dictionary of function arguments and default values which are translated into task hyperparameters. If not provided, all function arguments are translated into hyperparameters. * `function_return` - The names for storing the pipeline step’s returned objects as artifacts in its ClearML task. -* `cache_executed_step` - If `True`, the controller checks if an identical task with the same parameters was already - executed. If it was found, its outputs are used instead of launching a new task. +* `cache_executed_step` - If `True`, the controller will check if an identical task with the same code + (including setup, see task [Execution](../webapp/webapp_exp_track_visual.md#execution) + section) and input arguments was already executed. If found, the cached step's + outputs are used instead of launching a new task. * `parents` – Optional list of parent steps in the pipeline. The current step in the pipeline will be sent for execution only after all the parent steps have been executed successfully. * `pre_execute_callback` & `post_execute_callback` - Control pipeline flow with callback functions that can be called