5.0 KiB
title |
---|
ClearML Agent on Kubernetes |
The ClearML Agent enables scheduling and executing distributed experiments on a Kubernetes cluster.
Prerequisites
-
A running ClearML Enterprise Server
-
API credentials (
<ACCESS_KEY>
and<SECRET_KEY>
) generated via the ClearML UI (Settings > Workspace > API Credentials > Create new credentials). For more information, see ClearML API Credentials.:::note Make sure these credentials belong to an admin user or a service user with admin privileges. :::
-
The worker environment must be able to access the ClearML Server over the same network.
-
Helm token to access
clearml-enterprise
Helm chart repo
Installation
Add the Helm Repo Locally
Add the ClearML Helm repository:
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <HELM_REPO_TOKEN> --password <HELM_REPO_TOKEN>
Update the repository locally:
helm repo update
Create a Values Override File
Create a clearml-agent-values.override.yaml
file with the following content:
:::note
Replace the <ACCESS_KEY>
and <SECRET_KEY>
with the API credentials you generated earlier.
Set the <api|file|web>ServerUrlReference
fields to match your ClearML
Server URLs.
:::
imageCredentials:
password: "<CLEARML_DOCKERHUB_TOKEN>"
clearml:
agentk8sglueKey: "<ACCESS_KEY>"
agentk8sglueSecret: "<SECRET_KEY>"
agentk8sglue:
apiServerUrlReference: "<CLEARML_API_SERVER_REFERENCE>"
fileServerUrlReference: "<CLEARML_FILE_SERVER_REFERENCE>"
webServerUrlReference: "<CLEARML_WEB_SERVER_REFERENCE>"
createQueues: true
queues:
exampleQueue:
templateOverrides: {}
queueSettings: {}
Install the Chart
Install the ClearML Enterprise Agent Helm chart:
helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml
Additional Configuration Options
To view available configuration options for the Helm chart, run the following command:
helm show readme clearml-enterprise/clearml-enterprise-agent
# or
helm show values clearml-enterprise/clearml-enterprise-agent
Reporting GPU Availability to Orchestration Dashboard
To show GPU availability in the Orchestration Dashboard, explicitly set the number of GPUs:
agentk8sglue:
# -- Agent reporting to Dashboard max GPU available. This will report 2 GPUs.
dashboardReportMaxGpu: 2
Queues
The ClearML Agent monitors ClearML queues and pulls tasks that are scheduled for execution.
A single agent can monitor multiple queues. By default, all queues share a base pod template (agentk8sglue.basePodTemplate
)
used when launching tasks on Kubernetes after it has been pulled from the queue.
Each queue can be configured to override the base pod template with its own settings with a templateOverrides
queue template.
This way queue definitions can be tailored to different use cases.
The following are a few examples of agent queue templates:
Example: GPU Queues
To support GPU queues, you must deploy the NVIDIA GPU Operator on your Kubernetes cluster. For more information, see GPU Operator.
agentk8sglue:
createQueues: true
queues:
1xGPU:
templateOverrides:
resources:
limits:
nvidia.com/gpu: 1
2xGPU:
templateOverrides:
resources:
limits:
nvidia.com/gpu: 2
Example: Custom Pod Template per Queue
This example demonstrates how to override the base pod template definitions on a per-queue basis. In this example:
- The
red
queue inherits both the labelteam=red
and the 1Gi memory limit from thebasePodTemplate
section. - The
blue
queue overrides the label by setting it toteam=blue
, and inherits the 1Gi memory from thebasePodTemplate
section. - The
green
queue overrides the label by setting it toteam=green
, and overrides the memory limit by setting it to 2Gi. It also sets an annotation and an environment variable.
agentk8sglue:
# Defines common template
basePodTemplate:
labels:
team: red
resources:
limits:
memory: 1Gi
createQueues: true
queues:
red:
# Does not override
templateOverrides: {}
blue:
# Overrides labels
templateOverrides:
labels:
team: blue
green:
# Overrides labels and resources, plus set new fields
templateOverrides:
labels:
team: green
annotations:
example: "example value"
resources:
limits:
memory: 2Gi
env:
- name: MY_ENV
value: "my_value"
Next Steps
Once the agent is up and running, proceed with deploying the ClearML Enterprise App Gateway.