5.2 KiB
title |
---|
ClearML Agent on Kubernetes |
The ClearML Agent enables scheduling and executing distributed experiments on a Kubernetes cluster.
Prerequisites
-
A running ClearML Enterprise Server
-
API credentials (
<ACCESS_KEY>
and<SECRET_KEY>
) generated via the ClearML UI (Settings > Workspace > API Credentials > Create new credentials). For more information, see ClearML API Credentials.:::note Make sure these credentials belong to an admin user or a service account with admin privileges. :::
-
The worker environment must be able to access the ClearML Server over the same network.
-
Helm token to access
clearml-enterprise
Helm chart repo -
To support GPU queues, you must deploy the NVIDIA GPU Operator on your Kubernetes cluster. For more information, see GPU Operator.
Installation
Add the Helm Repo Locally
Add the ClearML Helm repository:
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <HELM_REPO_TOKEN> --password <HELM_REPO_TOKEN>
Update the local repository:
helm repo update
Create a Values Override File
Create a clearml-agent-values.override.yaml
file with the following content:
:::note
Replace the <ACCESS_KEY>
and <SECRET_KEY>
with the API credentials you generated earlier.
Set the <api|file|web>ServerUrlReference
fields to match your ClearML
Server URLs.
:::
imageCredentials:
password: "<CLEARML_DOCKERHUB_TOKEN>"
clearml:
agentk8sglueKey: "<ACCESS_KEY>"
agentk8sglueSecret: "<SECRET_KEY>"
agentk8sglue:
apiServerUrlReference: "<CLEARML_API_SERVER_REFERENCE>"
fileServerUrlReference: "<CLEARML_FILE_SERVER_REFERENCE>"
webServerUrlReference: "<CLEARML_WEB_SERVER_REFERENCE>"
createQueues: true
queues:
exampleQueue:
templateOverrides: {}
queueSettings: {}
Install the Chart
Install the ClearML Enterprise Agent Helm chart:
helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml
Additional Configuration Options
To view available configuration options for the Helm chart, run the following command:
helm show readme clearml-enterprise/clearml-enterprise-agent
# or
helm show values clearml-enterprise/clearml-enterprise-agent
Reporting GPU Availability to Orchestration Dashboard
To show GPU availability in the Orchestration Dashboard, explicitly set the number of GPUs:
agentk8sglue:
# -- Agent reporting to Dashboard max GPU available. This will report 2 GPUs.
dashboardReportMaxGpu: 2
Queues
The ClearML Agent monitors ClearML queues and pulls tasks that are scheduled for execution.
A single agent can monitor multiple queues. By default, all queues share a base pod template (agentk8sglue.basePodTemplate
)
used when launching tasks on Kubernetes after it has been pulled from the queue.
Each queue can be configured to override the base pod template with its own settings with a templateOverrides
queue template.
This way queue definitions can be tailored to different use cases.
The following are a few examples of agent queue templates:
Example: GPU Queues
To support GPU queues, you must deploy the NVIDIA GPU Operator on your Kubernetes cluster. For more information, see GPU Operator.
agentk8sglue:
createQueues: true
queues:
1xGPU:
templateOverrides:
resources:
limits:
nvidia.com/gpu: 1
2xGPU:
templateOverrides:
resources:
limits:
nvidia.com/gpu: 2
Example: Custom Pod Template per Queue
This example demonstrates how to override the base pod template definitions on a per-queue basis. In this example:
- The
red
queue inherits both the labelteam=red
and the 1Gi memory limit from thebasePodTemplate
section. - The
blue
queue overrides the label by setting it toteam=blue
, and inherits the 1Gi memory from thebasePodTemplate
section. - The
green
queue overrides the label by setting it toteam=green
, and overrides the memory limit by setting it to 2Gi. It also sets an annotation and an environment variable.
agentk8sglue:
# Defines common template
basePodTemplate:
labels:
team: red
resources:
limits:
memory: 1Gi
createQueues: true
queues:
red:
# Does not override
templateOverrides: {}
blue:
# Overrides labels
templateOverrides:
labels:
team: blue
green:
# Overrides labels and resources, plus set new fields
templateOverrides:
labels:
team: green
annotations:
example: "example value"
resources:
limits:
memory: 2Gi
env:
- name: MY_ENV
value: "my_value"
Next Steps
Once the agent is up and running, proceed with deploying the ClearML Enterprise App Gateway.