clearml-docs/agent_k8s.md at 60be54d54be808b36fa1812f8e9ff100e2c9051d

mirror of https://github.com/clearml/clearml-docs synced 2025-06-26 18:17:44 +00:00

revital aaa3851de3 Edits

2025-05-15 14:55:35 +03:00

5.0 KiB

Raw Blame History

title
ClearML Agent on Kubernetes

The ClearML Agent enables scheduling and executing distributed experiments on a Kubernetes cluster.

Prerequisites

A running ClearML Enterprise Server
API credentials (<ACCESS_KEY> and <SECRET_KEY>) generated via the ClearML UI (Settings > Workspace > API Credentials > Create new credentials). For more information, see ClearML API Credentials.

:::note Make sure these credentials belong to an admin user or a service user with admin privileges. :::
The worker environment must be able to access the ClearML Server over the same network.
Helm token to access clearml-enterprise Helm chart repo

Installation

Add the Helm Repo Locally

Add the ClearML Helm repository:

helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <HELM_REPO_TOKEN> --password <HELM_REPO_TOKEN>

Update the repository locally:

helm repo update

Create a Values Override File

Create a clearml-agent-values.override.yaml file with the following content:

:::note Replace the <ACCESS_KEY> and <SECRET_KEY>with the API credentials you generated earlier. Set the <api|file|web>ServerUrlReference fields to match your ClearML Server URLs. :::

imageCredentials:
  password: "<CLEARML_DOCKERHUB_TOKEN>"
clearml:
  agentk8sglueKey: "<ACCESS_KEY>"
  agentk8sglueSecret: "<SECRET_KEY>"
agentk8sglue:
  apiServerUrlReference: "<CLEARML_API_SERVER_REFERENCE>"
  fileServerUrlReference: "<CLEARML_FILE_SERVER_REFERENCE>"
  webServerUrlReference: "<CLEARML_WEB_SERVER_REFERENCE>"
  createQueues: true
  queues:
    exampleQueue:
      templateOverrides: {}
      queueSettings: {}

Install the Chart

Install the ClearML Enterprise Agent Helm chart:

helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml

Additional Configuration Options

To view available configuration options for the Helm chart, run the following command:

helm show readme clearml-enterprise/clearml-enterprise-agent
# or
helm show values clearml-enterprise/clearml-enterprise-agent

Reporting GPU Availability to Orchestration Dashboard

To show GPU availability in the Orchestration Dashboard, explicitly set the number of GPUs:

agentk8sglue:
  # -- Agent reporting to Dashboard max GPU available. This will report 2 GPUs.
  dashboardReportMaxGpu: 2

Queues

The ClearML Agent monitors ClearML queues and pulls tasks that are scheduled for execution.

A single agent can monitor multiple queues. By default, all queues share a base pod template (agentk8sglue.basePodTemplate) used when launching tasks on Kubernetes after it has been pulled from the queue.

Each queue can be configured to override the base pod template with its own settings with a templateOverrides queue template. This way queue definitions can be tailored to different use cases.

The following are a few examples of agent queue templates:

Example: GPU Queues

To support GPU queues, you must deploy the NVIDIA GPU Operator on your Kubernetes cluster. For more information, see GPU Operator.

agentk8sglue:
  createQueues: true
  queues:
    1xGPU:
      templateOverrides:
        resources:
          limits:
            nvidia.com/gpu: 1
    2xGPU:
      templateOverrides:
        resources:
          limits:
            nvidia.com/gpu: 2

Example: Custom Pod Template per Queue

This example demonstrates how to override the base pod template definitions on a per-queue basis. In this example:

The red queue inherits both the label team=red and the 1Gi memory limit from the basePodTemplate section.
The blue queue overrides the label by setting it to team=blue, and inherits the 1Gi memory from the basePodTemplate section.
The green queue overrides the label by setting it to team=green, and overrides the memory limit by setting it to 2Gi. It also sets an annotation and an environment variable.

agentk8sglue:
  # Defines common template
  basePodTemplate:
    labels:
      team: red
    resources:
      limits:
        memory: 1Gi
  createQueues: true
  queues:
    red:
      # Does not override
      templateOverrides: {}
    blue:
      # Overrides labels
      templateOverrides:
        labels:
          team: blue
    green:
      # Overrides labels and resources, plus set new fields
      templateOverrides:
        labels:
          team: green
        annotations:
          example: "example value"
        resources:
          limits:
            memory: 2Gi
        env:
          - name: MY_ENV
            value: "my_value"

Next Steps

Once the agent is up and running, proceed with deploying the ClearML Enterprise App Gateway.

5.0 KiB Raw Blame History