clearml-docs/docs/deploying_clearml/enterprise_deploy/agent_k8s.md
2025-05-13 13:33:57 +03:00

5.1 KiB

🟡 Ready, but missing hyperlinks (see TODOs) TODO:

  • Link: GPU Operator
  • Link: Additional configurations
  • Link: Now proceed with AI App Gateway

title: ClearML Agent on Kubernetes

The ClearML Agent allows scheduling distributed experiments on a Kubernetes cluster.

Prerequisites

  • The ClearML Enterprise server is up and running.

  • Create a set of <ACCESS_KEY> and <SECRET_KEY> credentials in the ClearML Server. The easiest way to do so is from the ClearML UI (Settings > Workspace > App Credentials > Create new credentials).

    :::note Make sure that the generated keys belong to an admin user or a service user with admin privileges. :::

  • The worker environment should be able to communicate to the ClearML Server over the same network.

Installation

Add the Helm Repo Locally

Add the ClearML Helm repository:

helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <HELM_REPO_TOKEN> --password <HELM_REPO_TOKEN>

Update the repository locally:

helm repo update

Prepare Values

Create a clearml-agent-values.override.yaml file with the following content:

:::note In the following configuration, replace the <ACCESS_KEY> and <SECRET_KEY> placeholders with the admin credentials you have generated on the ClearML Server. The values for <api|file|web>ServerUrlReference should point to your ClearML control-plane installation. :::

imageCredentials:
  password: "<CLEARML_DOCKERHUB_TOKEN>"
clearml:
  agentk8sglueKey: "<ACCESS_KEY>"
  agentk8sglueSecret: "<SECRET_KEY>"
agentk8sglue:
  apiServerUrlReference: "<CLEARML_API_SERVER_REFERENCE>"
  fileServerUrlReference: "<CLEARML_FILE_SERVER_REFERENCE>"
  webServerUrlReference: "<CLEARML_WEB_SERVER_REFERENCE>"
  createQueues: true
  queues:
    exampleQueue:
      templateOverrides: {}
      queueSettings: {}

Install the Chart

Install the ClearML Enterprise Agent Helm chart using the previous values override file:

helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml

Additional Configuration Options

:::note You can view the full set of available and documented values of the chart by running the following command:

helm show readme clearml-enterprise/clearml-enterprise-agent
# or
helm show values clearml-enterprise/clearml-enterprise-agent

:::

Report GPUs in the Dashboard

The Agent should explicitly report the total number of GPUs available on the cluster for it to appear in the dashboard reporting:

agentk8sglue:
  # -- Agent reporting to Dashboard max GPU available. This will report 2 GPUs.
  dashboardReportMaxGpu: 2

Queues

The ClearML Agent in Kubernetes monitors ClearML queues and pulls tasks that are scheduled for execution.

A single agent can monitor multiple queues, each queue sharing a Pod template (agentk8sglue.basePodTemplate) to be used when submitting a task to Kubernetes after it has been extracted from the queue.

Each queue can be configured with dedicated Pod template spec override (templateOverrides). This way queue definitions can be mixed and matched to serve multiple use-cases.

The Following are a few examples of agent queue templates.

GPU Queues

:::note The GPU queues configuration and usage from the ClearML Enterprise Agent requires deploying the Nvidia GPU Operator on your Kubernetes cluster. For more information, refer to the GPU Operator page. :::

agentk8sglue:
  createQueues: true
  queues:
    1xGPU:
      templateOverrides:
        resources:
          limits:
            nvidia.com/gpu: 1
    2xGPU:
      templateOverrides:
        resources:
          limits:
            nvidia.com/gpu: 2

Override a Pod Template by Queue

In the following example:

  • The red queue will inherit both the label team=red and the 1Gi memory limit from the basePodTemplate section.
  • The blue queue will set the label team=blue, but will inherit the 1Gi memory from the basePodTemplate section.
  • The green queue will set both the label team=green and a 2Gi memory limit. It will also set an annotation and an environment variable.
agentk8sglue:
  # Defines common template
  basePodTemplate:
    labels:
      team: red
    resources:
      limits:
        memory: 1Gi
  createQueues: true
  queues:
    red:
      # Does not override
      templateOverrides: {}
    blue:
      # Overrides labels
      templateOverrides:
        labels:
          team: blue
    green:
      # Overrides labels and resources, plus set new fields
      templateOverrides:
        labels:
          team: green
        annotations:
          example: "example value"
        resources:
          limits:
            memory: 2Gi
        env:
          - name: MY_ENV
            value: "my_value"

Next Steps

Once the ClearML Enterprise Agent is up and running, proceed with deploying the ClearML Enterprise App Gateway.

TODO link to the AI App Gateway page in documentation