clearml-docs/scaling_resources.md at d617703281a8e3b107894c916582ee7580bc6b14

mirror of https://github.com/clearml/clearml-docs synced 2025-04-22 07:15:59 +00:00

revital d617703281 Add scaling usecase

2025-03-26 14:02:07 +02:00

3.9 KiB

Raw Blame History

title: Autoscaling Resources

ClearML provides the options to automate your resource scaling, while optimizing machine usage. Autoscaling allows you to dynamically manage compute resources based on demand, optimizing efficiency and cost.

When running machine learning experiments or large-scale compute tasks, demand for resources fluctuates. Autoscaling ensures that:

Resources are available when needed, preventing delays in task execution.
Idle resources are automatically spun down, reducing unnecessary costs.
Workloads can be distributed efficiently.

ClearML offers the following resource autoscaling solutions:

GUI applications (available under the Pro and Enterprise plans) - Use the built-in apps to define your compute resource budget, and have the apps automatically manage your resource consumption as needed–with no code!
- AWS Autoscaler
- GCP Autoscaler
Kubernetes integration - Deploy agents in Kubernetes for automated resource allocation and scaling
Custom autoscaler implementation using the AutoScaler class

GUI Autoscaler Applications

For users on Pro and Enterprise plans, ClearML provides a UI applications to configure autoscaling for cloud resources. These applications include:

AWS Autoscaler: Automatically provisions and shuts down AWS EC2 instances based on workload demand.
GCP Autoscaler: Manages Google Cloud instances dynamically according to defined budgets.

These applications allow users to set up autoscaling with minimal configuration, defining compute budgets and resource limits directly through the UI.

Kubernetes Integration

You can install clearml-agent through a Helm chart.

ClearML integrates with Kubernetes, allowing agents to be deployed within a cluster. Kubernetes handles:

Automatic pod creation for executing tasks.
Resource allocation and scaling based on workload.
Optional integration with Kubernetes' cluster autoscaler, which adjusts the number of nodes dynamically.

The Clearml Agent deployment is set to service a queue(s). When tasks are added to the queues, the agent pulls the task and creates a pod to execute the task. Kubernetes handles resource management. Your task pod will remain pending until enough resources are available.

You can set up Kubernetes' cluster autoscaler to work with your cloud providers, which automatically adjusts the size of your Kubernetes cluster as needed; increasing the amount of nodes when there aren't enough to execute pods and removing underutilized nodes. See charts for specific cloud providers.

For more information, see ClearML Kubernetes Agent.

:::note Enterprise features The ClearML Enterprise plan supports K8S servicing multiple ClearML queues, as well as providing a pod template for each queue for describing the resources for each pod to use. See ClearML Helm Charts. :::

Custom Autoscaler Implementation

Users can build their own autoscaler using the clearml.automation.auto_scaler.AutoScaler class which enables:

Direct control over instance scaling logic.
Custom rules for resource allocation.

An AutoScaler instance monitors ClearML task queues and dynamically adjusts the number of cloud instances based on workload demand. By integrating with a CloudDriver, it supports multiple cloud providers like AWS and GCP.

See the AWS Autoscaler Example for a practical implementation using the AutoScaler class. The script can be adapted for GCP autoscaling as well.

3.9 KiB Raw Blame History Unescape Escape

title: Autoscaling Resources

GUI Autoscaler Applications

Kubernetes Integration

Custom Autoscaler Implementation

3.9 KiB

Raw Blame History