Clarify pipeline step caching (#474)

2025-06-26 18:17:44 +00:00 · 2023-02-16 20:57:02 +02:00 · 2023-02-16 20:57:02 +02:00 · bc06f92614
commit bc06f92614
parent 2cf096f7ec
3 changed files with 19 additions and 8 deletions
--- a/docs/pipelines/pipelines.md
+++ b/docs/pipelines/pipelines.md
@ -57,9 +57,15 @@ pipeline Task. To enable the automatic logging, use the `monitor_metrics`, `moni
 when creating a pipeline step.
 ### Pipeline Step Caching
-The Pipeline controller also offers step caching, meaning, reusing outputs of previously executed pipeline steps, in the 
+The Pipeline controller supports step caching, meaning, reusing outputs of previously executed pipeline steps. 
-case of exact same step code, and the same step input values. By default, pipeline steps are not cached. Enable caching
+
-when creating a pipeline step.
+Cached pipeline steps are reused when they meet the following criteria:
 * The step code is the same, including environment setup (components in the task's [Execution](../webapp/webapp_exp_track_visual.md#execution) 
 section, like required packages and docker image)
 * The step input arguments are unchanged, including step arguments and parameters (anything logged to the task's [Configuration](../webapp/webapp_exp_track_visual.md#configuration) 
 section)
 By default, pipeline steps are not cached. Enable caching when creating a pipeline step (for example, see [@PipelineDecorator.component](pipelines_sdk_function_decorators.md#pipelinedecoratorcomponent)).
 When a step is cached, the step code is hashed, alongside the step’s parameters (as passed in runtime), into a single 
 representing hash string. The pipeline first checks if a cached step exists in the system (archived Tasks will not be used 
--- a/docs/pipelines/pipelines_sdk_function_decorators.md
+++ b/docs/pipelines/pipelines_sdk_function_decorators.md
@ -88,8 +88,9 @@ def step_one(pickle_data_url: str, extra: int = 43):
 * `return_values` - The artifact names for the step’s corresponding ClearML task to store the step’s returned objects. 
  In the example above, a single object is returned and stored as an artifact named `data_frame`
 * `name` (Optional) - The name for the pipeline step. If not provided, the function name is used 
-* `cache` - If `True`, the pipeline controller checks if an identical step with the same parameters was already executed.
+* `cache` - If `True`, the pipeline controller checks if a step with the same code (including setup, see task [Execution](../webapp/webapp_exp_track_visual.md#execution) 
-  If found, its outputs are used instead of rerunning the step.
+  section) and input arguments was already executed. If found, the cached step's outputs are used 
  instead of rerunning the step. 
 * `packages` - A list of required packages or a local requirements.txt file. Example: `["tqdm>=2.1", "scikit-learn"]` or 
  `"./requirements.txt"`. If not provided, packages are automatically added based on the imports used inside the function.
 * `execution_queue` (Optional) - Queue in which to enqueue the specific step. This overrides the queue set with the 
--- a/docs/pipelines/pipelines_sdk_tasks.md
+++ b/docs/pipelines/pipelines_sdk_tasks.md
@ -75,7 +75,9 @@ pipe.add_step(
 * One of the following:
    * `base_task_project` and `base_task_name` - Project and name of the base task to clone
    * `base_task_id` - ID of the base task to clone
-* `cache_executed_step` – If `True`, the controller will check if an identical task with the same parameters was already executed. If it was found, its outputs will be used instead of launching a new task.
+* `cache_executed_step` – If `True`, the controller will check if an identical task with the same code (including setup, 
  e.g. required packages, docker image, etc.) and input arguments was already executed. If found, the cached step's 
  outputs are used instead of launching a new task.
 * `execution_queue` (Optional) - the queue to use for executing this specific step. If not provided, the task will be sent to the default execution queue, as defined on the class
 * `parents` – Optional list of parent steps in the pipeline. The current step in the pipeline will be sent for execution only after all the parent steps have been executed successfully.
 * `parameter_override` - Dictionary of parameters and values to override in the current step. See [parameter_override](#parameter_override).
@ -141,8 +143,10 @@ pipe.add_function_step(
 * `function_kwargs` (optional) - A dictionary of function arguments and default values which are translated into task 
  hyperparameters. If not provided, all function arguments are translated into hyperparameters.
 * `function_return` - The names for storing the pipeline step’s returned objects as artifacts in its ClearML task.
-* `cache_executed_step` - If `True`, the controller checks if an identical task with the same parameters was already 
+* `cache_executed_step` -  If `True`, the controller will check if an identical task with the same code 
-  executed. If it was found, its outputs are used instead of launching a new task.
+  (including setup, see task [Execution](../webapp/webapp_exp_track_visual.md#execution) 
  section) and input arguments was already executed. If found, the cached step's 
  outputs are used instead of launching a new task.
 * `parents` – Optional list of parent steps in the pipeline. The current step in the pipeline will be sent for execution 
  only after all the parent steps have been executed successfully.
 * `pre_execute_callback` & `post_execute_callback` - Control pipeline flow with callback functions that can be called