Add Cloud Autoscaler section (#523)

2025-06-26 18:17:44 +00:00 · 2023-04-04 16:00:39 +03:00
parent be3cf843ed
commit 3bc4357092
4 changed files with 121 additions and 0 deletions
--- a/docs/cloud_autoscaling/autoscaling_overview.md
+++ b/docs/cloud_autoscaling/autoscaling_overview.md
@@ -0,0 +1,111 @@
+---
+title: Overview
+---
+
+<div style={{position: 'relative', overflow: 'hidden', width: '100%', paddingTop: '56.25%' }} >
+<iframe style={{position: 'absolute', top: '0', left: '0', bottom: '0', right: '0', width: '100%', height: '100%'}} 
+        src="https://www.youtube.com/embed/j4XVMAaUt3E" 
+        title="YouTube video player" 
+        frameborder="0" 
+        allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; fullscreen" 
+        allowfullscreen>
+</iframe>
+</div>
+
+<br/>
+
+Using [ClearML Agent](../clearml_agent.md) and [queues](../fundamentals/agents_and_queues.md#what-is-a-queue), you can 
+easily run your code remotely on more powerful machines, including cloud instances. 
+
+Manually spinning up new virtual machines and setting up ClearML Agents on them can become a recurring task as your 
+workload grows, not to mention avoiding paying for running machines that aren’t being used, which can become pricey. 
+This is where autoscaling comes into the picture. 
+
+ClearML provides the following options to automate your resource scaling, while optimizing machine usage:
+* [ClearML autoscaler applications](#autoscaler-applications) -  Use the apps to define your compute resource budget, 
+and have the apps automatically manage your resource consumption as needed–with no code!
+* [Kubernetes integration](#kubernetes) - Deploy agents through Kubernetes, which handles resource management and scaling 
+
+## Autoscaler Applications
+ClearML provides the following GUI autoscaler applications:
+* [GPU Compute](../webapp/applications/apps_gpu_compute.md) (powered by Genesis Cloud)
+* [AWS Autoscaler](../webapp/applications/apps_aws_autoscaler.md)
+* [GCP Autoscaler](../webapp/applications/apps_gcp_autoscaler.md)
+
+The autoscalers automatically spin up or down cloud instances as needed and according to a budget that you set, so you 
+pay only for the time that you actually use the machines. 
+
+The **AWS** and **GCP** autoscaler applications will manage instances on your behalf in your cloud account. When 
+launching an app instance, you will provide your cloud service credentials so the autoscaler can access your account.  
+
+The **GPU Compute** application provides on-demand GPU instances powered by Genesis. All you need to do is define your compute resource budget, and you’re good to go. 
+
+## How ClearML Autoscaler Apps Work 
+
+![Autoscaler diagram](../img/autoscaler_single_queue_diagram.png)
+
+The diagram above demonstrates a typical flow for executing tasks through an autoscaler app:
+1. [Create a queue](../webapp/webapp_workers_queues.md#queues) to attach the autoscaler to 
+1. Set up an autoscaler app instance: assign it to a queue and define a compute resource budget (see the specific 
+autoscaler pages for further setup details)
+1. Launch the autoscaler app instance
+1. Enqueue a task to the queue the autoscaler has been assigned to
+1. The autoscaler attached to the queue spins up and prepares a new compute resource to execute the enqueued task
+1. Enqueue additional tasks: if there are not enough machines to execute the tasks, the autoscaler spins up additional 
+machines to execute the tasks (until the maximum number specified in the budget is reached)
+1. If a machine becomes idle since there are no tasks to execute, the autoscaler automatically spins it down
+
+### Utilizing Multiple Compute Resource Types
+
+You can work with multiple compute resources through the autoscalers, where each compute resource is associated with a 
+different queue. When a queue detects a task, the autoscaler spins up the appropriate resource to execute the task. 
+
+![Autoscaler diagram](../img/autoscaler_diagram.png)
+
+The diagram above demonstrates an example where an autoscaler app instance is attached to two queues. Each queue is 
+associated with a different resource, CPU and GPU, and each queue has two enqueued tasks. In order to execute the tasks, 
+the autoscaler spins up four machines, two CPU machines to execute the tasks in the CPU queue and two GPU machines to 
+execute the tasks in the GPU queue.
+
+:::note
+The GPU Compute app spins up a single compute resource, so you can launch multiple app instances in order to work with 
+multiple resources.
+:::
+
+### Task Execution Configuration
+
+#### Docker
+Every task a cloud instance pulls will be run inside a docker container. When setting up an autoscaler app instance, 
+you can specify a default container to run the tasks inside. If the task has its own container configured, it will 
+override the autoscaler’s default docker image (see [Base Docker Image](../clearml_agent.md#base-docker-container)).
+
+#### Git Configuration 
+If your code is saved in a private repository, you can add your Git credentials so the ClearML Agents running on your
+cloud instances will be able to retrieve the code from your repos.
+
+#### Cloud Storage Access
+If your tasks need to access data stored in cloud storage , you can provide your cloud storage credentials, so the 
+executed tasks will have access to your storage service. 
+
+#### Additional Configuration
+
+Go to a specific app’s documentation page to view all configuration options
+* [GPU Compute](../webapp/applications/apps_gpu_compute.md)
+* [AWS Autoscaler](../webapp/applications/apps_aws_autoscaler.md)
+* [GCP Autoscaler](../webapp/applications/apps_gcp_autoscaler.md) 
+
+## Kubernetes 
+ClearML offers an option to install `clearml-agent` through a Helm chart. 
+ 
+The Clearml Agent deployment is set to service a queue(s). When tasks are added to the queues, the agent pulls the task 
+and creates a pod to execute the task. Kubernetes handles resource management. Your task pod will remain pending until 
+enough resources are available.
+ 
+You can set up Kubernetes' cluster autoscaler to work with your cloud providers, which automatically adjusts the size of 
+your Kubernetes cluster as needed; increasing the amount of nodes when there aren't enough to execute pods and removing 
+underutilized nodes. See [charts](https://github.com/kubernetes/autoscaler/tree/master/charts) for specific cloud providers.
+
+:::note Enterprise features
+The ClearML Enterprise plan supports K8S servicing multiple ClearML queues, as well as providing a pod template for each 
+queue for describing the resources for each pod to use. See [ClearML Helm Charts](https://github.com/allegroai/clearml-helm-charts/tree/main).  
+:::
--- a/docs/img/autoscaler_diagram.png
+++ b/docs/img/autoscaler_diagram.png
--- a/docs/img/autoscaler_single_queue_diagram.png
+++ b/docs/img/autoscaler_single_queue_diagram.png
--- a/sidebars.js
+++ b/sidebars.js
@@ -37,6 +37,16 @@ module.exports = {
            'fundamentals/hpo']},
        {'ClearML SDK': ['clearml_sdk/clearml_sdk', 'clearml_sdk/task_sdk', 'clearml_sdk/model_sdk', 'clearml_sdk/apiclient_sdk']},
        'clearml_agent',
+        {'Cloud Autoscaling': [
+            'cloud_autoscaling/autoscaling_overview',
+             {'Autoscaler Apps': [
+                    {type: 'ref', id: 'webapp/applications/apps_gpu_compute'},
+                    {type: 'ref', id: 'webapp/applications/apps_aws_autoscaler'},
+                    {type: 'ref', id: 'webapp/applications/apps_gcp_autoscaler'},
+                 ]
+             }
+             ]
+        },
        {'ClearML Pipelines':['pipelines/pipelines',
                {"Building Pipelines":
                        ['pipelines/pipelines_sdk_tasks', 'pipelines/pipelines_sdk_function_decorators']