clearml-docs/docs/deploying_clearml/enterprise_deploy/agent_k8s.md
2025-05-20 13:33:57 +03:00

170 lines
5.0 KiB
Markdown

---
title: ClearML Agent on Kubernetes
---
The ClearML Agent enables scheduling and executing distributed experiments on a Kubernetes cluster.
## Prerequisites
- A running [ClearML Enterprise Server](k8s.md)
- API credentials (`<ACCESS_KEY>` and `<SECRET_KEY>`) generated via
the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). For more information, see [ClearML API Credentials](../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials).
:::note
Make sure these credentials belong to an admin user or a service account with admin privileges.
:::
- The worker environment must be able to access the ClearML Server over the same network.
- Helm token to access `clearml-enterprise` Helm chart repo
## Installation
### Add the Helm Repo Locally
Add the ClearML Helm repository:
```bash
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <HELM_REPO_TOKEN> --password <HELM_REPO_TOKEN>
```
Update the local repository:
```bash
helm repo update
```
### Create a Values Override File
Create a `clearml-agent-values.override.yaml` file with the following content:
:::note
Replace the `<ACCESS_KEY>` and `<SECRET_KEY>`with the API credentials you generated earlier.
Set the `<api|file|web>ServerUrlReference` fields to match your ClearML
Server URLs.
:::
```yaml
imageCredentials:
password: "<CLEARML_DOCKERHUB_TOKEN>"
clearml:
agentk8sglueKey: "<ACCESS_KEY>"
agentk8sglueSecret: "<SECRET_KEY>"
agentk8sglue:
apiServerUrlReference: "<CLEARML_API_SERVER_REFERENCE>"
fileServerUrlReference: "<CLEARML_FILE_SERVER_REFERENCE>"
webServerUrlReference: "<CLEARML_WEB_SERVER_REFERENCE>"
createQueues: true
queues:
exampleQueue:
templateOverrides: {}
queueSettings: {}
```
### Install the Chart
Install the ClearML Enterprise Agent Helm chart:
```bash
helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml
```
## Additional Configuration Options
To view available configuration options for the Helm chart, run the following command:
```bash
helm show readme clearml-enterprise/clearml-enterprise-agent
# or
helm show values clearml-enterprise/clearml-enterprise-agent
```
### Reporting GPU Availability to Orchestration Dashboard
To show GPU availability in the [Orchestration Dashboard](../../webapp/webapp_orchestration_dash.md), explicitly set the number of GPUs:
```yaml
agentk8sglue:
# -- Agent reporting to Dashboard max GPU available. This will report 2 GPUs.
dashboardReportMaxGpu: 2
```
### Queues
The ClearML Agent monitors [ClearML queues](../../fundamentals/agents_and_queues.md) and pulls tasks that are
scheduled for execution.
A single agent can monitor multiple queues. By default, all queues share a base pod template (`agentk8sglue.basePodTemplate`)
used when launching tasks on Kubernetes after it has been pulled from the queue.
Each queue can be configured to override the base pod template with its own settings with a `templateOverrides` queue template.
This way queue definitions can be tailored to different use cases.
The following are a few examples of agent queue templates:
#### Example: GPU Queues
To support GPU queues, you must deploy the NVIDIA GPU Operator on your Kubernetes cluster. For more information, see [GPU Operator](extra_configs/gpu_operator.md).
```yaml
agentk8sglue:
createQueues: true
queues:
1xGPU:
templateOverrides:
resources:
limits:
nvidia.com/gpu: 1
2xGPU:
templateOverrides:
resources:
limits:
nvidia.com/gpu: 2
```
#### Example: Custom Pod Template per Queue
This example demonstrates how to override the base pod template definitions on a per-queue basis.
In this example:
- The `red` queue inherits both the label `team=red` and the 1Gi memory limit from the `basePodTemplate` section.
- The `blue` queue overrides the label by setting it to `team=blue`, and inherits the 1Gi memory from the `basePodTemplate` section.
- The `green` queue overrides the label by setting it to `team=green`, and overrides the memory limit by setting it to 2Gi.
It also sets an annotation and an environment variable.
```yaml
agentk8sglue:
# Defines common template
basePodTemplate:
labels:
team: red
resources:
limits:
memory: 1Gi
createQueues: true
queues:
red:
# Does not override
templateOverrides: {}
blue:
# Overrides labels
templateOverrides:
labels:
team: blue
green:
# Overrides labels and resources, plus set new fields
templateOverrides:
labels:
team: green
annotations:
example: "example value"
resources:
limits:
memory: 2Gi
env:
- name: MY_ENV
value: "my_value"
```
## Next Steps
Once the agent is up and running, proceed with deploying the [ClearML Enterprise App Gateway](appgw_install_k8s.md).