mirror of
https://github.com/clearml/clearml-docs
synced 2025-01-31 22:48:40 +00:00
112 lines
6.4 KiB
Markdown
112 lines
6.4 KiB
Markdown
---
|
||
title: Workers & Queues
|
||
---
|
||
|
||
Two major components of MLOps are experiment reproducibility, and the ability to scale work to multiple machines. ClearML workers,
|
||
coupled with execution queues, address both these needs.
|
||
|
||
A ClearML worker is instantiated by launching a ClearML Agent, which is the base for **Automation** in ClearML and can be leveraged to build automated pipelines, launch custom services
|
||
(e.g. a [monitor and alert service](https://github.com/allegroai/clearml/tree/master/examples/services/monitoring)) and more.
|
||
|
||
## What Does a ClearML Agent Do?
|
||
The ClearML agent allows users to execute code on any machine it's installed on, thus facilitating the
|
||
scaling of data science work beyond one's own machine.
|
||
|
||
The agent takes care of deploying the code to the target machine as well as setting up the entire execution environment:
|
||
from installing required packages to setting environment variables,
|
||
all leading to executing the code (supporting both virtual environment or flexible docker container configurations).
|
||
|
||
The agent also supports overriding parameter values on-the-fly without code modification, thus enabling no-code experimentation (this is also the foundation on which
|
||
ClearML [Hyperparameter Optimization](hpo.md) is implemented).
|
||
|
||
An agent can be associated with specific GPUs, enabling workload distribution. For example, on a machine with 8 GPUs you
|
||
can allocate several GPUs to an agent and use the rest for a different workload, even through another agent (see [Dynamic GPU Allocation](../clearml_agent.md#dynamic-gpu-allocation)).
|
||
|
||
|
||
|
||
## What is a Queue?
|
||
|
||
A ClearML queue is an ordered list of Tasks scheduled for execution. One or multiple agents can service a queue.
|
||
Agents servicing a queue pull the queued tasks in order and execute them.
|
||
|
||
A ClearML Agent can service multiple queues in either of the following modes:
|
||
|
||
* Strict priority: The agent services the higher priority queue before servicing lower priority ones.
|
||
* Round robin: The agent pulls a single task from a queue then moves to service the next queue.
|
||
|
||
## Agent and Queue Workflow
|
||
|
||
![image](../img/clearml_agent_flow_diagram.png)
|
||
|
||
The diagram above demonstrates a typical flow where an agent executes a task:
|
||
|
||
1. Enqueue a task for execution on the queue.
|
||
1. The agent pulls the task from the queue.
|
||
1. The agent launches a docker container in which to run the task's code.
|
||
1. The task's execution environment is set up:
|
||
1. Execute any custom setup script configured.
|
||
1. Install any required system packages.
|
||
1. Clone the code from a git repository.
|
||
1. Apply any uncommitted changes recorded.
|
||
1. Set up the python environment and required packages.
|
||
1. The task's script/code is executed.
|
||
|
||
While the agent is running, it continuously reports system metrics to the ClearML Server. You can monitor these metrics
|
||
in the [**Workers and Queues**](../webapp/webapp_workers_queues.md) page.
|
||
|
||
## Resource Management
|
||
Installing an Agent on machines allows it to monitor all the machine's status (GPU / CPU / Memory / Network / Disk IO).
|
||
When managing multiple machines, this allows users to have an overview of their HW resources: the status of each machine,
|
||
the expected workload on each machine, etc.
|
||
|
||
![Workers and Queues page](../img/agents_queues_resource_management.png)
|
||
|
||
|
||
You can organize your queues according to resource usage. Say you have a single-GPU machine. You can create a queue called
|
||
`single-gpu-queue` and assign the machine's agent, as well as other single-GPU agents to that queue. This way you will know
|
||
that Tasks assigned to that queue will be executed by a single GPU machine.
|
||
|
||
While the agents are up and running in your machines, you can access these resources from any machine by enqueueing a
|
||
Task to one of your queues, according to the amount of resources you want to allocate to the Task.
|
||
|
||
With queues and ClearML Agent, you can easily add and remove machines from the cluster, and you can
|
||
reuse machines without the need for any dedicated containers or images.
|
||
|
||
## Additional Features
|
||
|
||
Agents can be deployed bare-metal, with multiple instances allocating
|
||
specific GPUs to the agents. They can also be deployed as dockers in a Kubernetes cluster.
|
||
|
||
The Agent has three running modes:
|
||
- Docker Mode: The agent spins a docker image based on the Task’s definition then inside the docker the agent will clone
|
||
the specified repository/code, apply the original execution’s uncommitted changes, install the required python packages
|
||
and start executing the code while monitoring it.
|
||
- Virtual Environment Mode: The agent creates a new virtual environment for the experiment, installs the required python
|
||
packages based on the Task specification, clones the code repository, applies the uncommitted changes and finally
|
||
executes the code while monitoring it.
|
||
- Conda Environment Mode: Similar to the Virtual Environment mode, only instead of using pip, it uses conda install and
|
||
pip combination. Notice this mode is quite brittle due to the Conda package version support table.
|
||
|
||
## Services Mode
|
||
|
||
In its default mode, a ClearML Agent executes a single task at a time, since training tasks typically require all resources
|
||
available to them. Some tasks are mostly idling and require less computation power, such as controller tasks (e.g.
|
||
a pipeline controller) or service tasks (e.g. cleanup service).
|
||
|
||
This is where the `services-modes` comes into play. An agent running in `services-mode` will let multiple tasks execute
|
||
in parallel (each task will register itself as a sub-agent, visible in the [Workers & Queues](../webapp/webapp_workers_queues.md) tab in the UI).
|
||
|
||
This mode is intended for running maintenance tasks. Some suitable tasks include:
|
||
|
||
- [Pipeline controller](../guides/pipeline/pipeline_controller.md) - Implementing the pipeline scheduling and logic
|
||
- [Hyper-Parameter Optimization](../guides/optimization/hyper-parameter-optimization/examples_hyperparam_opt.md) - Implementing an active selection of experiments
|
||
- [Control Service](../guides/services/aws_autoscaler.md) - AWS Autoscaler for example
|
||
- [External services](../guides/services/slack_alerts.md) - Such as Slack integration alert service
|
||
|
||
:::warning
|
||
Do not enqueue training or inference tasks into the services queue. They will put an unnecessary load on the server.
|
||
:::
|
||
|
||
By default, the open source [ClearML Server](../deploying_clearml/clearml_server.md) runs a single clearml-agent in
|
||
services mode that listens to the `services` queue.
|