Add Enterprise Server deployment usecases
Some checks failed
CI / build (push) Has been cancelled

This commit is contained in:
Noam Wasersprung
2025-05-16 17:10:15 +03:00
committed by GitHub
10 changed files with 1434 additions and 0 deletions

View File

@@ -0,0 +1,169 @@
---
title: ClearML Agent on Kubernetes
---
The ClearML Agent enables scheduling and executing distributed experiments on a Kubernetes cluster.
## Prerequisites
- A running [ClearML Enterprise Server](k8s.md)
- API credentials (`<ACCESS_KEY>` and `<SECRET_KEY>`) generated via
the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). For more information, see [ClearML API Credentials](../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials).
:::note
Make sure these credentials belong to an admin user or a service user with admin privileges.
:::
- The worker environment must be able to access the ClearML Server over the same network.
- Helm token to access `clearml-enterprise` Helm chart repo
## Installation
### Add the Helm Repo Locally
Add the ClearML Helm repository:
```bash
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <HELM_REPO_TOKEN> --password <HELM_REPO_TOKEN>
```
Update the repository locally:
```bash
helm repo update
```
### Create a Values Override File
Create a `clearml-agent-values.override.yaml` file with the following content:
:::note
Replace the `<ACCESS_KEY>` and `<SECRET_KEY>`with the API credentials you generated earlier.
Set the `<api|file|web>ServerUrlReference` fields to match your ClearML
Server URLs.
:::
```yaml
imageCredentials:
password: "<CLEARML_DOCKERHUB_TOKEN>"
clearml:
agentk8sglueKey: "<ACCESS_KEY>"
agentk8sglueSecret: "<SECRET_KEY>"
agentk8sglue:
apiServerUrlReference: "<CLEARML_API_SERVER_REFERENCE>"
fileServerUrlReference: "<CLEARML_FILE_SERVER_REFERENCE>"
webServerUrlReference: "<CLEARML_WEB_SERVER_REFERENCE>"
createQueues: true
queues:
exampleQueue:
templateOverrides: {}
queueSettings: {}
```
### Install the Chart
Install the ClearML Enterprise Agent Helm chart:
```bash
helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml
```
## Additional Configuration Options
To view available configuration options for the Helm chart, run the following command:
```bash
helm show readme clearml-enterprise/clearml-enterprise-agent
# or
helm show values clearml-enterprise/clearml-enterprise-agent
```
### Reporting GPU Availability to Orchestration Dashboard
To show GPU availability in the [Orchestration Dashboard](../../webapp/webapp_orchestration_dash.md), explicitly set the number of GPUs:
```yaml
agentk8sglue:
# -- Agent reporting to Dashboard max GPU available. This will report 2 GPUs.
dashboardReportMaxGpu: 2
```
### Queues
The ClearML Agent monitors [ClearML queues](../../fundamentals/agents_and_queues.md) and pulls tasks that are
scheduled for execution.
A single agent can monitor multiple queues. By default, all queues share a base pod template (`agentk8sglue.basePodTemplate`)
used when launching tasks on Kubernetes after it has been pulled from the queue.
Each queue can be configured to override the base pod template with its own settings with a `templateOverrides` queue template.
This way queue definitions can be tailored to different use cases.
The following are a few examples of agent queue templates:
#### Example: GPU Queues
To support GPU queues, you must deploy the NVIDIA GPU Operator on your Kubernetes cluster. For more information, see [GPU Operator](extra_configs/gpu_operator.md).
```yaml
agentk8sglue:
createQueues: true
queues:
1xGPU:
templateOverrides:
resources:
limits:
nvidia.com/gpu: 1
2xGPU:
templateOverrides:
resources:
limits:
nvidia.com/gpu: 2
```
#### Example: Custom Pod Template per Queue
This example demonstrates how to override the base pod template definitions on a per-queue basis.
In this example:
- The `red` queue inherits both the label `team=red` and the 1Gi memory limit from the `basePodTemplate` section.
- The `blue` queue overrides the label by setting it to `team=blue`, and inherits the 1Gi memory from the `basePodTemplate` section.
- The `green` queue overrides the label by setting it to `team=green`, and overrides the memory limit by setting it to 2Gi.
It also sets an annotation and an environment variable.
```yaml
agentk8sglue:
# Defines common template
basePodTemplate:
labels:
team: red
resources:
limits:
memory: 1Gi
createQueues: true
queues:
red:
# Does not override
templateOverrides: {}
blue:
# Overrides labels
templateOverrides:
labels:
team: blue
green:
# Overrides labels and resources, plus set new fields
templateOverrides:
labels:
team: green
annotations:
example: "example value"
resources:
limits:
memory: 2Gi
env:
- name: MY_ENV
value: "my_value"
```
## Next Steps
Once the agent is up and running, proceed with deploying the [ClearML Enterprise App Gateway](appgw_install_k8s.md).

View File

@@ -0,0 +1,271 @@
---
title: Dynamically Edit Task Pod Template
---
ClearML Agent allows you to inject custom Python code to dynamically modify the Kubernetes Pod template before applying it.
## Agent Configuration
The `CLEARML_K8S_GLUE_TEMPLATE_MODULE` environment variable defines the Python module and function inside that
module to be invoked by the agent before applying a task pod template.
The agent will run this code in its own context, pass arguments (including the actual template) to the function, and use
the returned template to create the final Task Pod in Kubernetes.
Arguments passed to the function include:
* `queue` (string) - ID of the queue from which the task was pulled.
* `queue_name` (string) - Name of the queue from which the task was pulled.
* `template` (Python dictionary) - Base Pod template created from the agent's configuration and any queue-specific overrides.
* `task_data` (object) - Task data object (as returned by the `tasks.get_by_id` API call). For example, use `task_data.project` to get the task's project ID.
* `providers_info` (dictionary) - Provider info containing optional information collected for the user running this task
when the user logged into the system (requires additional server configuration).
* `task_config` (`clearml_agent.backend_config.Config` object) - Task configuration containing configuration vaults applicable
for the user running this task, and other configuration. Use `task_config.get("...")` to get specific configuration values.
* `worker` - The agent Python object in case custom calls might be required.
### Usage
Update `clearml-agent-values.override.yaml` to include:
```yaml
agentk8sglue:
extraEnvs:
- name: CLEARML_K8S_GLUE_TEMPLATE_MODULE
value: "custom_code:update_template"
fileMounts:
- name: "custom_code.py"
folderPath: "/root"
fileContent: |-
import json
from pprint import pformat
def update_template(queue, task_data, providers_info, template, task_config, worker, queue_name, *args, **kwargs):
print(pformat(template))
my_var_name = "foo"
my_var_value = "bar"
try:
template["spec"]["containers"][0]["env"][0]["name"] = str(my_var_name)
template["spec"]["containers"][0]["env"][0]["value"] = str(my_var_value)
except KeyError as ex:
raise Exception("Failed modifying template: {}".format(ex))
return {"template": template}
```
:::note notes
* Always include `*args, **kwargs` at the end of the function's argument list and only use keyword arguments.
This is needed to maintain backward compatibility.
* Custom code modules can be included as a file in the pod's container, and the environment variable can be used to
point to the file and entry point.
* When defining a custom code module, by default the agent will start watching pods in all namespaces
across the cluster. If you do not intend to give a `ClusterRole` permission, make sure to set the
`CLEARML_K8S_GLUE_MONITOR_ALL_NAMESPACES` env to `"0"` to prevent the Agent to try listing pods in all namespaces.
Instead, set it to `"1"` if namespace-related changes are needed in the code.
```yaml
agentk8sglue:
extraEnvs:
- name: CLEARML_K8S_GLUE_MONITOR_ALL_NAMESPACES
value: "0"
```
To customize the bash startup scripts instead of the pod spec, use:
```yaml
agentk8sglue:
# -- Custom Bash script for the Agent pod ran by Glue Agent
customBashScript: ""
# -- Custom Bash script for the Task Pods ran by Glue Agent
containerCustomBashScript: ""
```
## Examples
### Example: Edit Template Based on ENV Var
```yaml
agentk8sglue:
extraEnvs:
- name: CLEARML_K8S_GLUE_TEMPLATE_MODULE
value: "custom_code:update_template"
fileMounts:
- name: "custom_code.py"
folderPath: "/root"
fileContent: |-
import json
from pprint import pformat
def update_template(queue, task_data, providers_info, template, task_config, worker, queue_name, *args, **kwargs):
print(pformat(template))
my_var = "some_var"
try:
template["spec"]["initContainers"][0]["command"][-1] = \
template["spec"]["initContainers"][0]["command"][-1].replace("MY_VAR", str(my_var))
template["spec"]["containers"][0]["volumeMounts"][0]["subPath"] = str(my_var)
except KeyError as ex:
raise Exception("Failed modifying template with MY_VAR: {}".format(ex))
return {"template": template}
basePodTemplate:
initContainers:
- name: myInitContainer
image: docker/ubuntu:18.04
command:
- /bin/bash
- -c
- >
echo MY_VAR;
volumeMounts:
- name: myTemplatedMount
mountPath: MY_VAR
volumes:
- name: myTemplatedMount
emptyDir: {}
```
### Example: Inject NFS Mount Path
```yaml
agentk8sglue:
extraEnvs:
- name: CLEARML_K8S_GLUE_TEMPLATE_MODULE
value: "custom_code:update_template"
fileMounts:
- name: "custom_code.py"
folderPath: "/root"
fileContent: |-
import json
from pprint import pformat
def update_template(queue, task_data, providers_info, template, task_config, worker, queue_name, *args, **kwargs):
nfs = task_config.get("nfs")
# ad_role = providers_info.get("ad-role")
if nfs: # and ad_role == "some-value"
print(pformat(template))
try:
template["spec"]["containers"][0]["volumeMounts"].append(
{"name": "custom-mount", "mountPath": nfs.get("mountPath")}
)
template["spec"]["containers"][0]["volumes"].append(
{"name": "custom-mount", "nfs": {"server": nfs.get("server.ip"), "path": nfs.get("server.path")}}
)
except KeyError as ex:
raise Exception("Failed modifying template: {}".format(ex))
return {"template": template}
```
### Example: Bind PVC Resource to Task Pod
In this example, a PVC is created and attached to every pod created from a dedicated queue, then it is deleted.
Key points:
- `CLEARML_K8S_GLUE_POD_PRE_APPLY_CMD` and `CLEARML_K8S_GLUE_POD_POST_DELETE_CMD` env vars let you define custom bash
code hooks to be executed around the main apply command by the Agent, such as creating and deleting a PVC object.
- `CLEARML_K8S_GLUE_TEMPLATE_MODULE` env var and a file mount let you define custom Python code in a specific context,
useful to dynamically update the main Pod template before the Agent applies it.
:::note notes
* This example uses a queue named `pvc-test`, make sure to replace all occurrences of it.
* `CLEARML_K8S_GLUE_POD_PRE_APPLY_CMD` can reference templated vars as `{queue_name}, {pod_name}, {namespace}` that will
get replaced with the actual value by the Agent at execution time.
```yaml
agentk8sglue:
# Bind a pre-defined custom 'custom-agent-role' Role with the ability to handle 'persistentvolumeclaims'
additionalRoleBindings:
- custom-agent-role
extraEnvs:
# Need this or permissions to list all namespaces
- name: CLEARML_K8S_GLUE_MONITOR_ALL_NAMESPACES
value: "0"
# Executed before applying the Task Pod. Replace the $PVC_NAME placeholder in the manifest template with the Pod name and apply it, only in a specific queue.
- name: CLEARML_K8S_GLUE_POD_PRE_APPLY_CMD
value: "[[ {queue_name} == 'pvc-test' ]] && sed 's/\\$PVC_NAME/{pod_name}/g' /mnt/yaml-manifests/pvc.yaml | kubectl apply -n {namespace} -f - || echo 'Skipping PRE_APPLY PVC creation...'"
# Executed after deleting the Task Pod. Delete the PVC.
- name: CLEARML_K8S_GLUE_POD_POST_DELETE_CMD
value: "kubectl delete pvc {pod_name} -n {namespace} || echo 'Skipping POST_DELETE PVC deletion...'"
# Define a custom code module for updating the Pod template
- name: CLEARML_K8S_GLUE_TEMPLATE_MODULE
value: "custom_code:update_template"
fileMounts:
# Mount a PVC manifest file with a templated $PVC_NAME name
- name: "pvc.yaml"
folderPath: "/mnt/yaml-manifests"
fileContent: |
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: $PVC_NAME
spec:
resources:
requests:
storage: 5Gi
volumeMode: Filesystem
storageClassName: standard
accessModes:
- ReadWriteOnce
# Custom code module for updating the Pod template
- name: "custom_code.py"
folderPath: "/root"
fileContent: |-
import json
from pprint import pformat
def update_template(queue, task_data, providers_info, template, task_config, worker, queue_name, *args, **kwargs):
if queue_name == "pvc-test":
# Set PVC_NAME as the name of the Pod
PVC_NAME = f"clearml-id-{task_data.id}"
try:
# Replace the claimName placeholder with a dynamic value
template["spec"]["volumes"][0]["persistentVolumeClaim"]["claimName"] = str(PVC_NAME)
except KeyError as ex:
raise Exception("Failed modifying template with PVC_NAME: {}".format(ex))
# Return the edited template
return {"template": template}
createQueues: true
queues:
# Define a queue with an override `volumes` and `volumeMounts` section for binding a PVC
pvc-test:
templateOverrides:
volumes:
- name: task-pvc
persistentVolumeClaim:
# PVC_NAME placeholder. This will get replaced in the custom code module.
claimName: PVC_NAME
volumeMounts:
- mountPath: "/tmp/task/"
name: task-pvc
```
### Example: Required Role
The following is an example of `custom-agent-role` Role with permissions to handle `persistentvolumeclaims`:
```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: custom-agent-role
rules:
- apiGroups:
- ""
resources:
- persistentvolumeclaims
verbs:
- get
- list
- watch
- create
- patch
- delete
```

View File

@@ -0,0 +1,61 @@
---
title: Basic Deployment - Suggested GPU Operator Values
---
This guide provides recommended configuration values for deploying the NVIDIA GPU Operator alongside ClearML Enterprise.
## Add the Helm Repo Locally
Add the NVIDIA GPU Operator Helm repository:
```bash
helm repo add nvidia https://nvidia.github.io/gpu-operator
```
Update the repository locally:
```bash
helm repo update
```
## Installation
To prevent unprivileged containers from bypassing the Kubernetes Device Plugin API, configure the GPU operator
using the following `gpu-operator.override.yaml` file:
```yaml
toolkit:
env:
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
value: "false"
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
value: "true"
devicePlugin:
env:
- name: PASS_DEVICE_SPECS
value: "true"
- name: FAIL_ON_INIT_ERROR
value: "true"
- name: DEVICE_LIST_STRATEGY # Use volume-mounts
value: volume-mounts
- name: DEVICE_ID_STRATEGY
value: uuid
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
```
Install the `gpu-operator`:
``` bash
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
```
## Fractional GPU Support
To enable fractional GPU allocation or manage mixed GPU configurations, refer to the following guides:
* [ClearML Dynamic MIG Operator](../fractional_gpus/cdmo.md) (CDMO) Dynamically configures MIG GPUs on supported devices.
* [ClearML Enterprise Fractional GPU Injector](../fractional_gpus/cfgi.md) (CFGI) Enables fractional (non-MIG) GPU
allocation for better hardware utilization and workload distribution in Kubernetes.
* [CDMO and CFGI on the same Cluster](../fractional_gpus/cdmo_cfgi_same_cluster.md) - In clusters with multiple nodes and
varying GPU types, the `gpu-operator` can be used to manage different device configurations and fractioning modes.

View File

@@ -0,0 +1,27 @@
---
title: Multi-Node Training
---
The ClearML Enterprise Agent supports horizontal multi-node training, allowing a single Task to run across multiple pods
on different nodes.
Below is a configuration example using `clearml-agent-values.override.yaml`:
```yaml
agentk8sglue:
# Cluster access is required to run multi-node Tasks
serviceAccountClusterAccess: true
multiNode:
enabled: true
createQueues: true
queues:
multi-node-example:
queueSettings:
# Defines the distribution of GPUs Tasks across multiple nodes. The format [x, y, ...] specifies the distribution of Tasks as 'x' GPUs on a node and 'y' GPUs on another node. Multiple Pods will be spawned respectively based on the lowest-common-denominator defined.
multiNode: [ 4, 2 ]
templateOverrides:
resources:
limits:
# Note you will need to use the lowest-common-denominator of the GPUs distribution defined in `queueSettings.multiNode`.
nvidia.com/gpu: 2
```

View File

@@ -0,0 +1,72 @@
---
title: ClearML Presign Service
---
The ClearML Presign Service is a secure service that generates and redirects pre-signed storage URLs for authenticated
users, enabling direct access to cloud-hosted data (e.g., S3) without exposing credentials.
## Prerequisites
- A ClearML Enterprise Server is up and running.
- API credentials (`<ACCESS_KEY>` and `<SECRET_KEY>`) generated via
the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). For more information, see [ClearML API Credentials](../../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials).
:::note
Make sure these credentials belong to an admin user or a service user with admin privileges.
:::
- The worker environment must be able to access the ClearML Server over the same network.
- Token to access `clearml-enterprise` Helm chart repo
## Installation
### Add the Helm Repo Locally
Add the ClearML Helm repository:
```bash
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <HELM_REPO_TOKEN> --password <HELM_REPO_TOKEN>
```
Update the repository locally:
```bash
helm repo update
```
### Prepare Configuration
Create a `presign-service.override.yaml` file (make sure to replace the placeholders):
```yaml
imageCredentials:
password: "<CLEARML_DOCKERHUB_TOKEN>"
clearml:
apiServerUrlReference: "<CLEARML_API_SERVER_URL>"
apiKey: "<ACCESS_KEY>"
apiSecret: "<SECRET_KEY>"
ingress:
enabled: true
hostName: "<PRESIGN_SERVICE_URL>"
```
### Deploy the Helm Chart
Install the `clearml-presign-service` Helm chart in the same namespace as the ClearML Enterprise server:
```bash
helm install -n clearml clearml-presign-service clearml-enterprise/clearml-presign-service -f presign-service.override.yaml
```
### Update ClearML Enterprise Server Configuration
Enable the ClearML Server to use the Presign Service by editing your `clearml-values.override.yaml` file.
Add the following to the `apiserver.extraEnvs` section (make sure to replace `<PRESIGN_SERVICE_URL>`).
```yaml
apiserver:
extraEnvs:
- name: CLEARML__SERVICES__SYSTEM__COMPANY__DEFAULT__SERVICES
value: "[{\"type\":\"presign\",\"url\":\"https://<PRESIGN_SERVICE_URL>\",\"use_fallback\":\"false\",\"match_sets\":[{\"rules\":[{\"field\":\"\",\"obj_type\":\"\",\"regex\":\"^s3://\"}]}]}]"
```
Apply the changes with a Helm upgrade.

View File

@@ -0,0 +1,201 @@
---
title: ClearML Tenant with Self Signed Certificates
---
This guide covers how to configure the [AI Application Gateway](#ai-application-gateway) and [ClearML Agent](#clearml-agent)
to use self-signed or custom SSL certificates.
## AI Application Gateway
To configure certificates for the Application Gateway, update your `clearml-app-gateway-values.override.yaml` file:
```yaml
# -- Custom certificates
customCertificates:
# -- Override system crt certificate bundle. Mutual exclusive with extraCerts.
overrideCaCertificatesCrt:
# -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt.
extraCerts:
- alias: certificateName
pem: |
-----BEGIN CERTIFICATE-----
###
-----END CERTIFICATE-----
```
You have two configuration options:
- [**Replace**](#replace-entire-ca-certificatescrt-file) the entire `ca-certificates.crt` file
- [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt) extra certificates to the existing `ca-certificates.crt`
### Replace Entire `ca-certificates.crt` File
To replace the whole ca-bundle, provide a concatenated list of all trusted CA certificates in `pem` format as
they are stored in a standard `ca-certificates.crt`.
```yaml
# -- Custom certificates
customCertificates:
# -- Override system crt certificate bundle. Mutual exclusive with extraCerts.
overrideCaCertificatesCrt: |
-----BEGIN CERTIFICATE-----
### CERT 1
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
### CERT 2
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
### CERT 3
-----END CERTIFICATE-----
...
```
### Append Extra Certificates to the Existing `ca-certificates.crt`
You can add certificates to the existing CA bundle. Each certificate must have a unique `alias`.
```yaml
# -- Custom certificates
customCertificates:
# -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt.
extraCerts:
- alias: certificate-name-1
pem: |
-----BEGIN CERTIFICATE-----
###
-----END CERTIFICATE-----
- alias: certificate-name-2
pem: |
-----BEGIN CERTIFICATE-----
###
-----END CERTIFICATE-----
```
### Apply Changes
To apply the changes, run the update command:
```bash
helm upgrade -i <RELEASE_NAME> -n <WORKLOAD_NAMESPACE> clearml-enterprise/clearml-enterprise-app-gateway --version <CHART_VERSION> -f clearml-app-gateway-values.override.yaml
```
## ClearML Agent
For the ClearML Agent, configure certificates in the `clearml-agent-values.override.yaml` file:
```yaml
# -- Custom certificates
customCertificates:
# -- Override system crt certificate bundle. Mutual exclusive with extraCerts.
overrideCaCertificatesCrt:
# -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt.
extraCerts:
- alias: certificateName
pem: |
-----BEGIN CERTIFICATE-----
###
-----END CERTIFICATE-----
```
You have two configuration options:
- [**Replace**](#replace-entire-ca-certificatescrt-file-1) the entire `ca-certificates.crt` file
- [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt-1) extra certificates to the existing `ca-certificates.crt`
### Replace Entire `ca-certificates.crt` File
To replace the whole ca-bundle, provide a concatenated list of all trusted CA certificates in `pem` format as
they are stored in a standard `ca-certificates.crt`.
```yaml
# -- Custom certificates
customCertificates:
# -- Override system crt certificate bundle. Mutual exclusive with extraCerts.
overrideCaCertificatesCrt: |
-----BEGIN CERTIFICATE-----
### CERT 1
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
### CERT 2
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
### CERT 3
-----END CERTIFICATE-----
...
```
### Append Extra Certificates to the Existing `ca-certificates.crt`
You can add certificates to the existing CA bundle. Each certificate must have a unique `alias`.
```yaml
# -- Custom certificates
customCertificates:
# -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt.
extraCerts:
- alias: certificate-name-1
pem: |
-----BEGIN CERTIFICATE-----
###
-----END CERTIFICATE-----
- alias: certificate-name-2
pem: |
-----BEGIN CERTIFICATE-----
###
-----END CERTIFICATE-----
```
### Add Certificates to Task Pods
If your workloads need access to these certificates (e.g., for HTTPS requests), configure the agent to inject them into pods:
```yaml
agentk8sglue:
basePodTemplate:
initContainers:
- command:
- /bin/sh
- -c
- update-ca-certificates
image: allegroai/clearml-enterprise-agent-k8s-base:<AGENT-VERSION-AVAIABLE-ON-REPO>
imagePullPolicy: IfNotPresent
name: init-task
volumeMounts:
- name: etc-ssl-certs
mountPath: "/etc/ssl/certs"
- name: clearml-extra-ca-certs
mountPath: "/usr/local/share/ca-certificates"
env:
- name: REQUESTS_CA_BUNDLE
value: /etc/ssl/certs/ca-certificates.crt
volumeMounts:
- name: etc-ssl-certs
mountPath: "/etc/ssl/certs"
volumes:
- name: etc-ssl-certs
emptyDir: {}
- name: clearml-extra-ca-certs
projected:
defaultMode: 420
sources:
# LIST HERE CONFIGMAPS CREATED BY THE AGENT CHART, THE CARDINALITY DEPENDS ON THE NUMBER OF CERTS PROVIDED.
- configMap:
name: clearml-agent-clearml-enterprise-agent-custom-ca-0
- configMap:
name: clearml-agent-clearml-enterprise-agent-custom-ca-1
```
The `clearml-extra-ca-certs` volume must include all `ConfigMap` resources generated by the agent for the custom certificates.
These `ConfigMaps` are automatically created by the Helm chart based on the number of certificates provided.
Their names are usually prefixed with the Helm release name, so adjust accordingly if you used a custom release name.
### Apply Changes
Apply the changes by running the update command:
``` bash
helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml
```

View File

@@ -0,0 +1,73 @@
---
title: SSO (Identity Provider) Setup
---
ClearML Enterprise Server supports various Single Sign-On (SSO) identity providers.
SSO configuration is managed via environment variables in your `clearml-values.override.yaml` file and is applied to the
`apiserver` component.
The following are configuration examples for commonly used providers. Other supported systems include:
* Auth0
* Keycloak
* Okta
* Azure AD
* Google
* AWS Cognito
## Auth0
```yaml
apiserver:
extraEnvs:
- name: CLEARML__secure__login__sso__oauth_client__auth0__client_id
value: "<AUTH0_CLIENT_ID>"
- name: CLEARML__secure__login__sso__oauth_client__auth0__client_secret
value: "<AUTH0_CLIENT_SECRET>"
- name: CLEARML__services__login__sso__oauth_client__auth0__base_url
value: "<AUTH0_BASE_URL>"
- name: CLEARML__services__login__sso__oauth_client__auth0__authorize_url
value: "<AUTH0_AUTHORIZE_URL>"
- name: CLEARML__services__login__sso__oauth_client__auth0__access_token_url
value: "<AUTH0_ACCESS_TOKEN_URL>"
- name: CLEARML__services__login__sso__oauth_client__auth0__audience
value: "<AUTH0_AUDIENCE>"
```
## Keycloak
```yaml
apiserver:
extraEnvs:
- name: CLEARML__secure__login__sso__oauth_client__keycloak__client_id
value: "<KC_CLIENT_ID>"
- name: CLEARML__secure__login__sso__oauth_client__keycloak__client_secret
value: "<KC_SECRET_ID>"
- name: CLEARML__services__login__sso__oauth_client__keycloak__base_url
value: "<KC_URL>/realms/<REALM_NAME>/"
- name: CLEARML__services__login__sso__oauth_client__keycloak__authorize_url
value: "<KC_URL>/realms/<REALM_NAME>/protocol/openid-connect/auth"
- name: CLEARML__services__login__sso__oauth_client__keycloak__access_token_url
value: "<KC_URL>/realms/<REALM_NAME>/protocol/openid-connect/token"
- name: CLEARML__services__login__sso__oauth_client__keycloak__idp_logout
value: "true"
```
## Group Membership Mapping in Keycloak
To map Keycloak groups into the ClearML user's SSO token:
1. Go to the **Client Scopes** tab.
1. Click on the `<clearml client>-dedicated` scope.
1. Click **Add Mapper > By Configuration > Group Membership**
1. Configure the mapper:
* Select the **Name** "groups"
* Set **Token Claim Name** "groups"
* Uncheck the **Full group path**
* Save the mapper.
To verify:
1. Go to the **Client Details > Client scope** tab.
1. Go to the **Evaluate** sub-tab and select a user with any group memberships.
1. Go to **Generated ID Token** and then to **Generated User Info**.
1. Inspect that in both cases you can see the group's claim in the displayed user data.

View File

@@ -0,0 +1,132 @@
---
title: ClearML Dynamic MIG Operator (CDMO)
---
The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG (Multi-Instance GPU) configurations.
## Installation
### Requirements
* Add and update the Nvidia Helm repo:
```bash
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
```
* Create a `gpu-operator.override.yaml` file with the following content:
```yaml
migManager:
enabled: false
mig:
strategy: mixed
toolkit:
env:
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
value: "false"
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
value: "true"
devicePlugin:
env:
- name: PASS_DEVICE_SPECS
value: "true"
- name: FAIL_ON_INIT_ERROR
value: "true"
- name: DEVICE_LIST_STRATEGY # Use volume-mounts
value: volume-mounts
- name: DEVICE_ID_STRATEGY
value: uuid
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
```
* Install the NVIDIA `gpu-operator` using Helm with the previous configuration:
```bash
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
```
### Installing CDMO
1. Create a `cdmo-values.override.yaml` file with the following content:
```yaml
imageCredentials:
password: "<CLEARML_DOCKERHUB_TOKEN>"
```
1. Install the CDMO Helm Chart using the previous override file:
```bash
helm install -n cdmo cdmo clearml-enterprise/clearml-dynamic-mig-operator --create-namespace -f cdmo-values.override.yaml
```
1. Enable the NVIDIA MIG support on your cluster by running the following command on all nodes with a MIG-supported GPU
(run it for each GPU `<GPU_ID>` on the host):
```bash
nvidia-smi -mig 1
```
:::note notes
* A node reboot may be required if the command output indicates so.
* For convenience, this command can be run from within the `nvidia-device-plugin-daemonset` pod running on the related node.
:::
1. Label all MIG-enabled GPU node `<NODE_NAME>` from the previous step:
```bash
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning=mig"
```
## Disabling MIGs
To disable MIG mode and restore standard full-GPU access:
1. Ensure no running workflows are using GPUs on the target node(s).
2. Remove the CDMO label from the target node(s) to disable the dynamic MIG reconfiguration.
```bash
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning-"
```
3. Execute a shell into the `device-plugin-daemonset` pod instance running on the target node(s) and execute the following commands:
```bash
nvidia-smi mig -dci
nvidia-smi mig -dgi
nvidia-smi -mig 0
```
4. Edit the `gpu-operator.override.yaml` file to restore full-GPU access, and upgrade the `gpu-operator`:
```yaml
toolkit:
env:
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
value: "false"
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
value: "true"
devicePlugin:
env:
- name: PASS_DEVICE_SPECS
value: "true"
- name: FAIL_ON_INIT_ERROR
value: "true"
- name: DEVICE_LIST_STRATEGY # Use volume-mounts
value: volume-mounts
- name: DEVICE_ID_STRATEGY
value: uuid
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
```

View File

@@ -0,0 +1,123 @@
---
title: Install CDMO and CFGI on the Same Cluster
---
You can install both CDMO (ClearML Dynamic MIG Orchestrator) and CFGI (ClearML Fractional GPU Injector) on a shared Kubernetes cluster.
In clusters with multiple nodes and varying GPU types, the `gpu-operator` can be used to manage different device configurations
and fractioning modes.
## Configuring the NVIDIA GPU Operator
The NVIDIA `gpu-operator` supports defining multiple configurations for the Device Plugin.
The following example YAML defines two configurations: "mig" and "ts" (time-slicing).
```yaml
migManager:
enabled: false
mig:
strategy: mixed
toolkit:
env:
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
value: "false"
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
value: "true"
devicePlugin:
enabled: true
env:
- name: PASS_DEVICE_SPECS
value: "true"
- name: FAIL_ON_INIT_ERROR
value: "true"
- name: DEVICE_LIST_STRATEGY
value: volume-mounts
- name: DEVICE_ID_STRATEGY
value: uuid
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
config:
name: device-plugin-config
create: true
default: "all-disabled"
data:
all-disabled: |-
version: v1
flags:
migStrategy: none
ts: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
# Edit the following configuration as needed, adding as many GPU indices as many cards are installed on the Host.
resources:
- name: nvidia.com/gpu
rename: nvidia.com/gpu-0
devices:
- "0"
replicas: 8
mig: |-
version: v1
flags:
migStrategy: mixed
```
## Applying Configuration to Nodes
Label each Kubernetes node accordingly to activate a specific GPU mode:
|Mode| Label command|
|----|-----|
| `mig` | `kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=mig` |
| `ts` (time slicing) | `kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=ts` |
| Standard full-GPU access | `kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled` |
After a node is labeled, the NVIDIA `device-plugin` will automatically reload the new configuration.
## Installing CDMO and CFGI
After configuring the NVIDIA `gpu-operator` and labeling nodes, proceed with the standard installations of [CDMO](cdmo.md)
and [CFGI](cfgi.md).
## Disabling Configurations
### Time Slicing
To disable time-slicing and use full GPU access, update the node label using the `--overwrite` flag:
```bash
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled --overwrite
```
### MIG
To disable MIG mode:
1. Ensure there are no more running workflows requesting any form of GPU on the node(s).
2. Remove the CDMO label from the target node(s).
```bash
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning-"
```
3. Execute a shell in the `device-plugin-daemonset` pod instance running on the target node(s) and execute the following commands:
```bash
nvidia-smi mig -dci
nvidia-smi mig -dgi
nvidia-smi -mig 0
```
4. Label the node to use standard (non-MIG) GPU mode:
```bash
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled --overwrite
```

View File

@@ -0,0 +1,305 @@
---
title: ClearML Fractional GPU Injector (CFGI)
---
The **ClearML Enterprise Fractional GPU Injector** (CFGI) allows AI workloads to utilize fractional (non-MIG) GPU slices
on Kubernetes clusters, maximizing hardware efficiency and performance.
## Installation
### Add the Local ClearML Helm Repository
```bash
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <GITHUB_TOKEN> --password <GITHUB_TOKEN>
helm repo update
```
### Requirements
* Install the NVIDIA `gpu-operator` using Helm
* Set the number of GPU slices to 8
* Add and update the Nvidia Helm repo:
```bash
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
```
* Credentials for the ClearML Enterprise DockerHub repository
### GPU Operator Configuration
#### For CFGI Version >= 1.3.0
1. Create a Docker Registry secret named `clearml-dockerhub-access` in the `gpu-operator` namespace. Make sure to replace `<CLEARML_DOCKERHUB_TOKEN>` with your token.
```bash
kubectl create secret -n gpu-operator docker-registry clearml-dockerhub-access \
--docker-server=docker.io \
--docker-username=allegroaienterprise \
--docker-password="<CLEARML_DOCKERHUB_TOKEN>" \
--docker-email=""
```
1. Create a `gpu-operator.override.yaml` file as follows:
* Set `devicePlugin.repository` to `docker.io/clearml`
* Configure `devicePlugin.config.data.renamed-resources.sharing.timeSlicing.resources` for each GPU index on the host
* Use `nvidia.com/gpu-<INDEX>` format for the `rename` field, and set `replicas` to `8`.
```yaml
gfd:
imagePullSecrets:
- "clearml-dockerhub-access"
toolkit:
env:
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
value: "false"
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
value: "true"
devicePlugin:
repository: docker.io/clearml
image: k8s-device-plugin
version: v0.17.1-gpu-card-selection
imagePullPolicy: Always
imagePullSecrets:
- "clearml-dockerhub-access"
env:
- name: PASS_DEVICE_SPECS
value: "true"
- name: FAIL_ON_INIT_ERROR
value: "true"
- name: DEVICE_LIST_STRATEGY # Use volume-mounts
value: volume-mounts
- name: DEVICE_ID_STRATEGY
value: uuid
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
config:
name: device-plugin-config
create: true
default: "renamed-resources"
data:
renamed-resources: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
# Edit the following configuration as needed, adding as many GPU indices as many cards are installed on the Host.
resources:
- name: nvidia.com/gpu
rename: nvidia.com/gpu-0
devices:
- "0"
replicas: 8
- name: nvidia.com/gpu
rename: nvidia.com/gpu-1
devices:
- "1"
replicas: 8
```
#### For CFGI version < 1.3.0 (Legacy)
Create a `gpu-operator.override.yaml` file:
```yaml
toolkit:
env:
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
value: "false"
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
value: "true"
devicePlugin:
env:
- name: PASS_DEVICE_SPECS
value: "true"
- name: FAIL_ON_INIT_ERROR
value: "true"
- name: DEVICE_LIST_STRATEGY # Use volume-mounts
value: volume-mounts
- name: DEVICE_ID_STRATEGY
value: uuid
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
config:
name: device-plugin-config
create: true
default: "any"
data:
any: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
resources:
- name: nvidia.com/gpu
replicas: 8
```
### Install GPU Operator and CFGI
1. Install the NVIDIA `gpu-operator` using the previously created `gpu-operator.override.yaml` file:
```bash
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
```
1. Create a `cfgi-values.override.yaml` file with the following content:
```yaml
imageCredentials:
password: "<CLEARML_DOCKERHUB_TOKEN>"
```
1. Install the CFGI Helm Chart using the previous override file:
```bash
helm install -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector --create-namespace -f cfgi-values.override.yaml
```
## Usage
To use fractional GPUs, label your pod with:
```yaml
labels:
clearml-injector/fraction: "<GPU_FRACTION_VALUE>"
```
Valid values for `"<GPU_FRACTION_VALUE>"` include:
* Fractions:
* "0.0625" (1/16th)
* "0.125" (1/8th)
* "0.250"
* "0.375"
* "0.500"
* "0.625"
* "0.750"
* "0.875"
* Integer representation of GPUs such as `1.000`, `2`, `2.0`, etc.
### ClearML Agent Configuration
To run ClearML jobs with fractional GPU allocation, configure your queues in accordingly in your `clearml-agent-values.override.yaml` file.
Each queue should include a `templateOverride` that sets the `clearml-injector/fraction` label, which determines the
fraction of a GPU to allocate (e.g., "0.500" for half a GPU).
This label is used by the CFGI to assign the correct portion of GPU resources to the pod running the task.
#### CFGI Version >= 1.3.0
Starting from version 1.3.0, there is no need to specify the resources field. You only need to set the labels:
``` yaml
agentk8sglue:
createQueues: true
queues:
gpu-fraction-1_000:
templateOverrides:
labels:
clearml-injector/fraction: "1.000"
gpu-fraction-0_500:
templateOverrides:
labels:
clearml-injector/fraction: "0.500"
gpu-fraction-0_250:
templateOverrides:
labels:
clearml-injector/fraction: "0.250"
gpu-fraction-0_125:
templateOverrides:
labels:
clearml-injector/fraction: "0.125"
```
#### CFGI Version < 1.3.0
For versions older than 1.3.0, the GPU limits must be defined:
```yaml
agentk8sglue:
createQueues: true
queues:
gpu-fraction-1_000:
templateOverrides:
resources:
limits:
nvidia.com/gpu: 8
gpu-fraction-0_500:
templateOverrides:
labels:
clearml-injector/fraction: "0.500"
resources:
limits:
nvidia.com/gpu: 4
gpu-fraction-0_250:
templateOverrides:
labels:
clearml-injector/fraction: "0.250"
resources:
limits:
nvidia.com/gpu: 2
gpu-fraction-0_125:
templateOverrides:
labels:
clearml-injector/fraction: "0.125"
resources:
limits:
nvidia.com/gpu: 1
```
## Upgrading CFGI Chart
To upgrade to the latest chart version:
```bash
helm repo update
helm upgrade -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector
```
To apply new values to an existing installation:
```bash
helm upgrade -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector -f cfgi-values.override.yaml
```
## Disabling Fractions
To revert to standard GPU scheduling (without time slicing), remove the `devicePlugin.config` section from the `gpu-operator.override.yaml`
file and upgrade the `gpu-operator`:
```yaml
toolkit:
env:
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
value: "false"
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
value: "true"
devicePlugin:
env:
- name: PASS_DEVICE_SPECS
value: "true"
- name: FAIL_ON_INIT_ERROR
value: "true"
- name: DEVICE_LIST_STRATEGY # Use volume-mounts
value: volume-mounts
- name: DEVICE_ID_STRATEGY
value: uuid
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
```