clearml-docs/docs/webapp/webapp_workers_queues.md

137 lines
6.1 KiB
Markdown
Raw Normal View History

2021-05-13 23:48:51 +00:00
---
2023-05-17 08:38:28 +00:00
title: Orchestration
2021-05-13 23:48:51 +00:00
---
2023-05-17 08:38:28 +00:00
With the **Orchestration** page, you can:
2021-05-13 23:48:51 +00:00
2023-05-17 08:38:28 +00:00
* Use Cloud autoscaling apps to define your compute resource budget, and have the apps automatically manage your resource
consumption as needed-with no code (available under the ClearML Pro plan)
2021-05-23 20:17:12 +00:00
* Monitor resources (CPU and GPU, memory, video memory, and network usage) used by the experiments / Tasks that workers
2021-05-13 23:48:51 +00:00
execute
* View workers and the queues they listen to
2023-05-17 08:38:28 +00:00
* Manage worker queues
* Create and rename queues
* Delete empty queues
* Monitor queue utilization
* Reorder, move, and remove experiments from queues
## Autoscalers
:::info Pro Plan Offering
The ClearML Autoscaler apps are available under the ClearML Pro plan
:::
Use the **AUTOSCALERS** tab to access ClearML's cloud autoscaling applications:
* GPU Compute (powered by Genesis Cloud)
* AWS Autoscaler
* GCP Autoscaler
The autoscalers automatically spin up or down cloud instances as needed and according to a budget that you set, so you
pay only for the time that you actually use the machines.
The **AWS** and **GCP** autoscaler applications will manage instances on your behalf in your cloud account. When
launching an app instance, you will provide your cloud service credentials so the autoscaler can access your account.
The **GPU Compute** application provides on-demand GPU instances powered by Genesis. All you need to do is define your
compute resource budget, and you're good to go.
Once you launch an autoscaler app instance, you can monitor the autoscaler's activity and your cloud usage in the instance's
dashboard.
For more information about how autoscalers work, see the [Cloud Autoscaling Overview](../cloud_autoscaling/autoscaling_overview.md).
For more information about a specific autoscaler, see [GPU Compute](applications/apps_gpu_compute.md), [AWS Autoscaler](applications/apps_aws_autoscaler.md),
2023-05-18 20:28:36 +00:00
and/or [GCP Autoscaler](applications/apps_gcp_autoscaler.md).
2023-05-17 08:38:28 +00:00
![Cloud autoscalers](../img/webapp_orchestration_autoscalers.png)
2021-05-13 23:48:51 +00:00
2022-04-07 08:49:11 +00:00
## Workers
Use the **WORKERS** tab to track worker activity and monitor worker utilization.
The page shows a worker activity graph and a worker details table. The graph time span can be controlled through the menu
at its top right corner. Hover over any plot point to see its data. By default, the **WORKER UTILIZATION** graph displays the
number of active and total workers over time.
The worker table shows the currently available workers and their current execution information:
* Current running experiment
* Current execution time
* Training iteration.
Clicking on a worker will open the workers details panel and replace the graph with that workers resource utilization
2023-01-25 11:25:29 +00:00
information. The resource metric being monitored can be selected through the menu at the graphs top left corner:
2022-04-07 08:49:11 +00:00
* CPU and GPU Usage
* Memory Usage
* Video Memory Usage
* Network Usage.
The workers details panel includes the following two tabs:
* **INFO** - worker information:
* Worker Name
* Update time - The last time the worker reported data
* Current Experiment - The experiment currently being executed by the worker
* Experiment Runtime - How long the currently executing experiment has been running
* Experiment iteration - The last reported training iteration for the experiment
2023-08-01 14:05:53 +00:00
* **QUEUES** - Information about the queues that the worker is assigned to:
2022-04-07 08:49:11 +00:00
* Queue - The name of the Queue
* Next experiment - The next experiment available in this queue
* In Queue - The number of experiments currently enqueued
2021-05-23 20:17:12 +00:00
2022-02-13 08:29:12 +00:00
![Worker management](../img/agents_queues_resource_management.png)
2021-05-13 23:48:51 +00:00
2022-04-07 08:49:11 +00:00
## Queues
2021-05-13 23:48:51 +00:00
2022-04-07 08:49:11 +00:00
Use the **QUEUES** tab to manage queues and monitor their statistics. The page shows graphs of the average experiment
wait time and the number of queued experiments, and a queue details table. Hover over any plot point to view its data.
By default, the graphs display the overall information of all queues.
2021-05-13 23:48:51 +00:00
2022-05-18 07:24:00 +00:00
The queue table shows the following queue information:
2022-04-07 08:49:11 +00:00
* Queue - Queue name
2022-05-18 07:24:00 +00:00
* Workers - Number of workers servicing the queue
2022-04-07 08:49:11 +00:00
* Next Experiment - The next experiment available in this queue
* Last Updated - The last time queue contents were modified
* In Queue - Number of experiments currently enqueued in the queue
2021-05-13 23:48:51 +00:00
2022-04-07 08:49:11 +00:00
To create a new queue - Click **+ NEW QUEUE** (top left).
2021-05-13 23:48:51 +00:00
2022-04-07 08:49:11 +00:00
Hover over a queue and click <img src="/docs/latest/icons/ico-copy-to-clipboard.svg" alt="Copy" className="icon size-md space-sm" />
to copy the queues ID.
2021-05-13 23:48:51 +00:00
![image](../img/4100.png)
2022-04-07 08:49:11 +00:00
Right-click on a queue or hover and click its action button <img src="/docs/latest/icons/ico-dots-v-menu.svg" alt="Dot menu" className="icon size-md space-sm" />
to access queue actions:
2021-05-13 23:48:51 +00:00
2022-04-07 08:49:11 +00:00
![Queue context menu](../img/webapp_workers_queues_context.png)
* Delete - Delete the queue. Any pending tasks will be dequeued.
* Rename - Change the queues name
* Clear - Remove all pending tasks from the queue
* Custom action - The ClearML Enterprise Server provides a mechanism to define your own custom actions, which will
appear in the context menu. See [Custom UI Context Menu Actions](../deploying_clearml/clearml_server_config.md#custom-ui-context-menu-actions)
Clicking on a queue will open the queues details panel and replace the graphs with that queues statistics.
2021-05-13 23:48:51 +00:00
2022-04-07 08:49:11 +00:00
The queues details panel includes the following two tabs:
* **EXPERIMENTS** - A list of experiments in the queue. You can reorder and remove enqueued experiments. See
[Controlling Queue Contents](#controlling-queue-contents).
2023-08-01 14:05:53 +00:00
* **WORKERS** - Information about the workers assigned to the queue:
2022-04-07 08:49:11 +00:00
* Name - Worker name
* IP - Workers IP
* Currently Executing - The experiment currently being executed by the worker
2021-05-13 23:48:51 +00:00
2022-04-07 08:49:11 +00:00
### Controlling Queue Contents
2021-05-13 23:48:51 +00:00
2022-04-07 08:49:11 +00:00
Click on an experiments menu button <img src="/docs/latest/icons/ico-dots-v-menu.svg" alt="Dot menu" className="icon size-md space-sm" />
in the **EXPERIMENTS** tab to reorganize your queue:
2022-03-06 11:00:46 +00:00
2022-04-07 08:49:11 +00:00
![Queue experiment's menu](../img/workers_queues_experiment_actions.png)
2022-03-06 11:00:46 +00:00
2022-04-07 08:49:11 +00:00
* Move a task to the top or bottom of the queue
* Move the task to a different queue
* Dequeue the task
You can also reorder experiments in a queue by dragging an experiment to a new position in the queue.