diff --git a/docs/cloud_autoscaling/autoscaling_overview.md b/docs/cloud_autoscaling/autoscaling_overview.md new file mode 100644 index 00000000..e96b6e28 --- /dev/null +++ b/docs/cloud_autoscaling/autoscaling_overview.md @@ -0,0 +1,111 @@ +--- +title: Overview +--- + +
+ +
+ +
+ +Using [ClearML Agent](../clearml_agent.md) and [queues](../fundamentals/agents_and_queues.md#what-is-a-queue), you can +easily run your code remotely on more powerful machines, including cloud instances. + +Manually spinning up new virtual machines and setting up ClearML Agents on them can become a recurring task as your +workload grows, not to mention avoiding paying for running machines that aren’t being used, which can become pricey. +This is where autoscaling comes into the picture. + +ClearML provides the following options to automate your resource scaling, while optimizing machine usage: +* [ClearML autoscaler applications](#autoscaler-applications) - Use the apps to define your compute resource budget, +and have the apps automatically manage your resource consumption as needed–with no code! +* [Kubernetes integration](#kubernetes) - Deploy agents through Kubernetes, which handles resource management and scaling + +## Autoscaler Applications +ClearML provides the following GUI autoscaler applications: +* [GPU Compute](../webapp/applications/apps_gpu_compute.md) (powered by Genesis Cloud) +* [AWS Autoscaler](../webapp/applications/apps_aws_autoscaler.md) +* [GCP Autoscaler](../webapp/applications/apps_gcp_autoscaler.md) + +The autoscalers automatically spin up or down cloud instances as needed and according to a budget that you set, so you +pay only for the time that you actually use the machines. + +The **AWS** and **GCP** autoscaler applications will manage instances on your behalf in your cloud account. When +launching an app instance, you will provide your cloud service credentials so the autoscaler can access your account. + +The **GPU Compute** application provides on-demand GPU instances powered by Genesis. All you need to do is define your compute resource budget, and you’re good to go. + +## How ClearML Autoscaler Apps Work + +![Autoscaler diagram](../img/autoscaler_single_queue_diagram.png) + +The diagram above demonstrates a typical flow for executing tasks through an autoscaler app: +1. [Create a queue](../webapp/webapp_workers_queues.md#queues) to attach the autoscaler to +1. Set up an autoscaler app instance: assign it to a queue and define a compute resource budget (see the specific +autoscaler pages for further setup details) +1. Launch the autoscaler app instance +1. Enqueue a task to the queue the autoscaler has been assigned to +1. The autoscaler attached to the queue spins up and prepares a new compute resource to execute the enqueued task +1. Enqueue additional tasks: if there are not enough machines to execute the tasks, the autoscaler spins up additional +machines to execute the tasks (until the maximum number specified in the budget is reached) +1. If a machine becomes idle since there are no tasks to execute, the autoscaler automatically spins it down + +### Utilizing Multiple Compute Resource Types + +You can work with multiple compute resources through the autoscalers, where each compute resource is associated with a +different queue. When a queue detects a task, the autoscaler spins up the appropriate resource to execute the task. + +![Autoscaler diagram](../img/autoscaler_diagram.png) + +The diagram above demonstrates an example where an autoscaler app instance is attached to two queues. Each queue is +associated with a different resource, CPU and GPU, and each queue has two enqueued tasks. In order to execute the tasks, +the autoscaler spins up four machines, two CPU machines to execute the tasks in the CPU queue and two GPU machines to +execute the tasks in the GPU queue. + +:::note +The GPU Compute app spins up a single compute resource, so you can launch multiple app instances in order to work with +multiple resources. +::: + +### Task Execution Configuration + +#### Docker +Every task a cloud instance pulls will be run inside a docker container. When setting up an autoscaler app instance, +you can specify a default container to run the tasks inside. If the task has its own container configured, it will +override the autoscaler’s default docker image (see [Base Docker Image](../clearml_agent.md#base-docker-container)). + +#### Git Configuration +If your code is saved in a private repository, you can add your Git credentials so the ClearML Agents running on your +cloud instances will be able to retrieve the code from your repos. + +#### Cloud Storage Access +If your tasks need to access data stored in cloud storage , you can provide your cloud storage credentials, so the +executed tasks will have access to your storage service. + +#### Additional Configuration + +Go to a specific app’s documentation page to view all configuration options +* [GPU Compute](../webapp/applications/apps_gpu_compute.md) +* [AWS Autoscaler](../webapp/applications/apps_aws_autoscaler.md) +* [GCP Autoscaler](../webapp/applications/apps_gcp_autoscaler.md) + +## Kubernetes +ClearML offers an option to install `clearml-agent` through a Helm chart. + +The Clearml Agent deployment is set to service a queue(s). When tasks are added to the queues, the agent pulls the task +and creates a pod to execute the task. Kubernetes handles resource management. Your task pod will remain pending until +enough resources are available. + +You can set up Kubernetes' cluster autoscaler to work with your cloud providers, which automatically adjusts the size of +your Kubernetes cluster as needed; increasing the amount of nodes when there aren't enough to execute pods and removing +underutilized nodes. See [charts](https://github.com/kubernetes/autoscaler/tree/master/charts) for specific cloud providers. + +:::note Enterprise features +The ClearML Enterprise plan supports K8S servicing multiple ClearML queues, as well as providing a pod template for each +queue for describing the resources for each pod to use. See [ClearML Helm Charts](https://github.com/allegroai/clearml-helm-charts/tree/main). +::: diff --git a/docs/img/autoscaler_diagram.png b/docs/img/autoscaler_diagram.png new file mode 100644 index 00000000..4bee2cdc Binary files /dev/null and b/docs/img/autoscaler_diagram.png differ diff --git a/docs/img/autoscaler_single_queue_diagram.png b/docs/img/autoscaler_single_queue_diagram.png new file mode 100644 index 00000000..04770279 Binary files /dev/null and b/docs/img/autoscaler_single_queue_diagram.png differ diff --git a/sidebars.js b/sidebars.js index cc2bef2c..e78313f2 100644 --- a/sidebars.js +++ b/sidebars.js @@ -37,6 +37,16 @@ module.exports = { 'fundamentals/hpo']}, {'ClearML SDK': ['clearml_sdk/clearml_sdk', 'clearml_sdk/task_sdk', 'clearml_sdk/model_sdk', 'clearml_sdk/apiclient_sdk']}, 'clearml_agent', + {'Cloud Autoscaling': [ + 'cloud_autoscaling/autoscaling_overview', + {'Autoscaler Apps': [ + {type: 'ref', id: 'webapp/applications/apps_gpu_compute'}, + {type: 'ref', id: 'webapp/applications/apps_aws_autoscaler'}, + {type: 'ref', id: 'webapp/applications/apps_gcp_autoscaler'}, + ] + } + ] + }, {'ClearML Pipelines':['pipelines/pipelines', {"Building Pipelines": ['pipelines/pipelines_sdk_tasks', 'pipelines/pipelines_sdk_function_decorators']