mirror of
https://github.com/clearml/clearml-docs
synced 2025-06-26 18:17:44 +00:00
This commit is contained in:
169
docs/deploying_clearml/enterprise_deploy/agent_k8s.md
Normal file
169
docs/deploying_clearml/enterprise_deploy/agent_k8s.md
Normal file
@@ -0,0 +1,169 @@
|
||||
---
|
||||
title: ClearML Agent on Kubernetes
|
||||
---
|
||||
|
||||
The ClearML Agent enables scheduling and executing distributed experiments on a Kubernetes cluster.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- A running [ClearML Enterprise Server](k8s.md)
|
||||
- API credentials (`<ACCESS_KEY>` and `<SECRET_KEY>`) generated via
|
||||
the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). For more information, see [ClearML API Credentials](../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials).
|
||||
|
||||
:::note
|
||||
Make sure these credentials belong to an admin user or a service user with admin privileges.
|
||||
:::
|
||||
|
||||
- The worker environment must be able to access the ClearML Server over the same network.
|
||||
- Helm token to access `clearml-enterprise` Helm chart repo
|
||||
|
||||
## Installation
|
||||
|
||||
### Add the Helm Repo Locally
|
||||
|
||||
Add the ClearML Helm repository:
|
||||
```bash
|
||||
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <HELM_REPO_TOKEN> --password <HELM_REPO_TOKEN>
|
||||
```
|
||||
|
||||
Update the repository locally:
|
||||
```bash
|
||||
helm repo update
|
||||
```
|
||||
|
||||
### Create a Values Override File
|
||||
|
||||
Create a `clearml-agent-values.override.yaml` file with the following content:
|
||||
|
||||
:::note
|
||||
Replace the `<ACCESS_KEY>` and `<SECRET_KEY>`with the API credentials you generated earlier.
|
||||
Set the `<api|file|web>ServerUrlReference` fields to match your ClearML
|
||||
Server URLs.
|
||||
:::
|
||||
|
||||
```yaml
|
||||
imageCredentials:
|
||||
password: "<CLEARML_DOCKERHUB_TOKEN>"
|
||||
clearml:
|
||||
agentk8sglueKey: "<ACCESS_KEY>"
|
||||
agentk8sglueSecret: "<SECRET_KEY>"
|
||||
agentk8sglue:
|
||||
apiServerUrlReference: "<CLEARML_API_SERVER_REFERENCE>"
|
||||
fileServerUrlReference: "<CLEARML_FILE_SERVER_REFERENCE>"
|
||||
webServerUrlReference: "<CLEARML_WEB_SERVER_REFERENCE>"
|
||||
createQueues: true
|
||||
queues:
|
||||
exampleQueue:
|
||||
templateOverrides: {}
|
||||
queueSettings: {}
|
||||
```
|
||||
|
||||
### Install the Chart
|
||||
|
||||
Install the ClearML Enterprise Agent Helm chart:
|
||||
|
||||
```bash
|
||||
helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml
|
||||
```
|
||||
|
||||
## Additional Configuration Options
|
||||
|
||||
To view available configuration options for the Helm chart, run the following command:
|
||||
|
||||
```bash
|
||||
helm show readme clearml-enterprise/clearml-enterprise-agent
|
||||
# or
|
||||
helm show values clearml-enterprise/clearml-enterprise-agent
|
||||
```
|
||||
|
||||
### Reporting GPU Availability to Orchestration Dashboard
|
||||
|
||||
To show GPU availability in the [Orchestration Dashboard](../../webapp/webapp_orchestration_dash.md), explicitly set the number of GPUs:
|
||||
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
# -- Agent reporting to Dashboard max GPU available. This will report 2 GPUs.
|
||||
dashboardReportMaxGpu: 2
|
||||
```
|
||||
|
||||
### Queues
|
||||
|
||||
The ClearML Agent monitors [ClearML queues](../../fundamentals/agents_and_queues.md) and pulls tasks that are
|
||||
scheduled for execution.
|
||||
|
||||
A single agent can monitor multiple queues. By default, all queues share a base pod template (`agentk8sglue.basePodTemplate`)
|
||||
used when launching tasks on Kubernetes after it has been pulled from the queue.
|
||||
|
||||
Each queue can be configured to override the base pod template with its own settings with a `templateOverrides` queue template.
|
||||
This way queue definitions can be tailored to different use cases.
|
||||
|
||||
The following are a few examples of agent queue templates:
|
||||
|
||||
#### Example: GPU Queues
|
||||
|
||||
To support GPU queues, you must deploy the NVIDIA GPU Operator on your Kubernetes cluster. For more information, see [GPU Operator](extra_configs/gpu_operator.md).
|
||||
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
createQueues: true
|
||||
queues:
|
||||
1xGPU:
|
||||
templateOverrides:
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
2xGPU:
|
||||
templateOverrides:
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 2
|
||||
```
|
||||
|
||||
#### Example: Custom Pod Template per Queue
|
||||
|
||||
This example demonstrates how to override the base pod template definitions on a per-queue basis.
|
||||
In this example:
|
||||
|
||||
- The `red` queue inherits both the label `team=red` and the 1Gi memory limit from the `basePodTemplate` section.
|
||||
- The `blue` queue overrides the label by setting it to `team=blue`, and inherits the 1Gi memory from the `basePodTemplate` section.
|
||||
- The `green` queue overrides the label by setting it to `team=green`, and overrides the memory limit by setting it to 2Gi.
|
||||
It also sets an annotation and an environment variable.
|
||||
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
# Defines common template
|
||||
basePodTemplate:
|
||||
labels:
|
||||
team: red
|
||||
resources:
|
||||
limits:
|
||||
memory: 1Gi
|
||||
createQueues: true
|
||||
queues:
|
||||
red:
|
||||
# Does not override
|
||||
templateOverrides: {}
|
||||
blue:
|
||||
# Overrides labels
|
||||
templateOverrides:
|
||||
labels:
|
||||
team: blue
|
||||
green:
|
||||
# Overrides labels and resources, plus set new fields
|
||||
templateOverrides:
|
||||
labels:
|
||||
team: green
|
||||
annotations:
|
||||
example: "example value"
|
||||
resources:
|
||||
limits:
|
||||
memory: 2Gi
|
||||
env:
|
||||
- name: MY_ENV
|
||||
value: "my_value"
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
Once the agent is up and running, proceed with deploying the [ClearML Enterprise App Gateway](appgw_install_k8s.md).
|
||||
|
||||
@@ -0,0 +1,271 @@
|
||||
---
|
||||
title: Dynamically Edit Task Pod Template
|
||||
---
|
||||
|
||||
ClearML Agent allows you to inject custom Python code to dynamically modify the Kubernetes Pod template before applying it.
|
||||
|
||||
|
||||
## Agent Configuration
|
||||
|
||||
The `CLEARML_K8S_GLUE_TEMPLATE_MODULE` environment variable defines the Python module and function inside that
|
||||
module to be invoked by the agent before applying a task pod template.
|
||||
|
||||
The agent will run this code in its own context, pass arguments (including the actual template) to the function, and use
|
||||
the returned template to create the final Task Pod in Kubernetes.
|
||||
|
||||
Arguments passed to the function include:
|
||||
|
||||
* `queue` (string) - ID of the queue from which the task was pulled.
|
||||
* `queue_name` (string) - Name of the queue from which the task was pulled.
|
||||
* `template` (Python dictionary) - Base Pod template created from the agent's configuration and any queue-specific overrides.
|
||||
* `task_data` (object) - Task data object (as returned by the `tasks.get_by_id` API call). For example, use `task_data.project` to get the task's project ID.
|
||||
* `providers_info` (dictionary) - Provider info containing optional information collected for the user running this task
|
||||
when the user logged into the system (requires additional server configuration).
|
||||
* `task_config` (`clearml_agent.backend_config.Config` object) - Task configuration containing configuration vaults applicable
|
||||
for the user running this task, and other configuration. Use `task_config.get("...")` to get specific configuration values.
|
||||
* `worker` - The agent Python object in case custom calls might be required.
|
||||
|
||||
### Usage
|
||||
|
||||
Update `clearml-agent-values.override.yaml` to include:
|
||||
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
extraEnvs:
|
||||
- name: CLEARML_K8S_GLUE_TEMPLATE_MODULE
|
||||
value: "custom_code:update_template"
|
||||
fileMounts:
|
||||
- name: "custom_code.py"
|
||||
folderPath: "/root"
|
||||
fileContent: |-
|
||||
import json
|
||||
from pprint import pformat
|
||||
|
||||
def update_template(queue, task_data, providers_info, template, task_config, worker, queue_name, *args, **kwargs):
|
||||
print(pformat(template))
|
||||
|
||||
my_var_name = "foo"
|
||||
my_var_value = "bar"
|
||||
|
||||
try:
|
||||
template["spec"]["containers"][0]["env"][0]["name"] = str(my_var_name)
|
||||
template["spec"]["containers"][0]["env"][0]["value"] = str(my_var_value)
|
||||
except KeyError as ex:
|
||||
raise Exception("Failed modifying template: {}".format(ex))
|
||||
|
||||
return {"template": template}
|
||||
```
|
||||
|
||||
:::note notes
|
||||
* Always include `*args, **kwargs` at the end of the function's argument list and only use keyword arguments.
|
||||
This is needed to maintain backward compatibility.
|
||||
|
||||
* Custom code modules can be included as a file in the pod's container, and the environment variable can be used to
|
||||
point to the file and entry point.
|
||||
|
||||
* When defining a custom code module, by default the agent will start watching pods in all namespaces
|
||||
across the cluster. If you do not intend to give a `ClusterRole` permission, make sure to set the
|
||||
`CLEARML_K8S_GLUE_MONITOR_ALL_NAMESPACES` env to `"0"` to prevent the Agent to try listing pods in all namespaces.
|
||||
Instead, set it to `"1"` if namespace-related changes are needed in the code.
|
||||
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
extraEnvs:
|
||||
- name: CLEARML_K8S_GLUE_MONITOR_ALL_NAMESPACES
|
||||
value: "0"
|
||||
```
|
||||
|
||||
To customize the bash startup scripts instead of the pod spec, use:
|
||||
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
# -- Custom Bash script for the Agent pod ran by Glue Agent
|
||||
customBashScript: ""
|
||||
# -- Custom Bash script for the Task Pods ran by Glue Agent
|
||||
containerCustomBashScript: ""
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
### Example: Edit Template Based on ENV Var
|
||||
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
extraEnvs:
|
||||
- name: CLEARML_K8S_GLUE_TEMPLATE_MODULE
|
||||
value: "custom_code:update_template"
|
||||
fileMounts:
|
||||
- name: "custom_code.py"
|
||||
folderPath: "/root"
|
||||
fileContent: |-
|
||||
import json
|
||||
from pprint import pformat
|
||||
def update_template(queue, task_data, providers_info, template, task_config, worker, queue_name, *args, **kwargs):
|
||||
print(pformat(template))
|
||||
|
||||
my_var = "some_var"
|
||||
|
||||
try:
|
||||
template["spec"]["initContainers"][0]["command"][-1] = \
|
||||
template["spec"]["initContainers"][0]["command"][-1].replace("MY_VAR", str(my_var))
|
||||
template["spec"]["containers"][0]["volumeMounts"][0]["subPath"] = str(my_var)
|
||||
except KeyError as ex:
|
||||
raise Exception("Failed modifying template with MY_VAR: {}".format(ex))
|
||||
|
||||
return {"template": template}
|
||||
basePodTemplate:
|
||||
initContainers:
|
||||
- name: myInitContainer
|
||||
image: docker/ubuntu:18.04
|
||||
command:
|
||||
- /bin/bash
|
||||
- -c
|
||||
- >
|
||||
echo MY_VAR;
|
||||
volumeMounts:
|
||||
- name: myTemplatedMount
|
||||
mountPath: MY_VAR
|
||||
volumes:
|
||||
- name: myTemplatedMount
|
||||
emptyDir: {}
|
||||
```
|
||||
|
||||
### Example: Inject NFS Mount Path
|
||||
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
extraEnvs:
|
||||
- name: CLEARML_K8S_GLUE_TEMPLATE_MODULE
|
||||
value: "custom_code:update_template"
|
||||
fileMounts:
|
||||
- name: "custom_code.py"
|
||||
folderPath: "/root"
|
||||
fileContent: |-
|
||||
import json
|
||||
from pprint import pformat
|
||||
def update_template(queue, task_data, providers_info, template, task_config, worker, queue_name, *args, **kwargs):
|
||||
nfs = task_config.get("nfs")
|
||||
# ad_role = providers_info.get("ad-role")
|
||||
if nfs: # and ad_role == "some-value"
|
||||
print(pformat(template))
|
||||
|
||||
try:
|
||||
template["spec"]["containers"][0]["volumeMounts"].append(
|
||||
{"name": "custom-mount", "mountPath": nfs.get("mountPath")}
|
||||
)
|
||||
template["spec"]["containers"][0]["volumes"].append(
|
||||
{"name": "custom-mount", "nfs": {"server": nfs.get("server.ip"), "path": nfs.get("server.path")}}
|
||||
)
|
||||
except KeyError as ex:
|
||||
raise Exception("Failed modifying template: {}".format(ex))
|
||||
|
||||
return {"template": template}
|
||||
```
|
||||
|
||||
### Example: Bind PVC Resource to Task Pod
|
||||
|
||||
In this example, a PVC is created and attached to every pod created from a dedicated queue, then it is deleted.
|
||||
|
||||
Key points:
|
||||
|
||||
- `CLEARML_K8S_GLUE_POD_PRE_APPLY_CMD` and `CLEARML_K8S_GLUE_POD_POST_DELETE_CMD` env vars let you define custom bash
|
||||
code hooks to be executed around the main apply command by the Agent, such as creating and deleting a PVC object.
|
||||
|
||||
- `CLEARML_K8S_GLUE_TEMPLATE_MODULE` env var and a file mount let you define custom Python code in a specific context,
|
||||
useful to dynamically update the main Pod template before the Agent applies it.
|
||||
|
||||
:::note notes
|
||||
* This example uses a queue named `pvc-test`, make sure to replace all occurrences of it.
|
||||
|
||||
* `CLEARML_K8S_GLUE_POD_PRE_APPLY_CMD` can reference templated vars as `{queue_name}, {pod_name}, {namespace}` that will
|
||||
get replaced with the actual value by the Agent at execution time.
|
||||
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
# Bind a pre-defined custom 'custom-agent-role' Role with the ability to handle 'persistentvolumeclaims'
|
||||
additionalRoleBindings:
|
||||
- custom-agent-role
|
||||
extraEnvs:
|
||||
# Need this or permissions to list all namespaces
|
||||
- name: CLEARML_K8S_GLUE_MONITOR_ALL_NAMESPACES
|
||||
value: "0"
|
||||
# Executed before applying the Task Pod. Replace the $PVC_NAME placeholder in the manifest template with the Pod name and apply it, only in a specific queue.
|
||||
- name: CLEARML_K8S_GLUE_POD_PRE_APPLY_CMD
|
||||
value: "[[ {queue_name} == 'pvc-test' ]] && sed 's/\\$PVC_NAME/{pod_name}/g' /mnt/yaml-manifests/pvc.yaml | kubectl apply -n {namespace} -f - || echo 'Skipping PRE_APPLY PVC creation...'"
|
||||
# Executed after deleting the Task Pod. Delete the PVC.
|
||||
- name: CLEARML_K8S_GLUE_POD_POST_DELETE_CMD
|
||||
value: "kubectl delete pvc {pod_name} -n {namespace} || echo 'Skipping POST_DELETE PVC deletion...'"
|
||||
# Define a custom code module for updating the Pod template
|
||||
- name: CLEARML_K8S_GLUE_TEMPLATE_MODULE
|
||||
value: "custom_code:update_template"
|
||||
fileMounts:
|
||||
# Mount a PVC manifest file with a templated $PVC_NAME name
|
||||
- name: "pvc.yaml"
|
||||
folderPath: "/mnt/yaml-manifests"
|
||||
fileContent: |
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: $PVC_NAME
|
||||
spec:
|
||||
resources:
|
||||
requests:
|
||||
storage: 5Gi
|
||||
volumeMode: Filesystem
|
||||
storageClassName: standard
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
# Custom code module for updating the Pod template
|
||||
- name: "custom_code.py"
|
||||
folderPath: "/root"
|
||||
fileContent: |-
|
||||
import json
|
||||
from pprint import pformat
|
||||
def update_template(queue, task_data, providers_info, template, task_config, worker, queue_name, *args, **kwargs):
|
||||
if queue_name == "pvc-test":
|
||||
# Set PVC_NAME as the name of the Pod
|
||||
PVC_NAME = f"clearml-id-{task_data.id}"
|
||||
try:
|
||||
# Replace the claimName placeholder with a dynamic value
|
||||
template["spec"]["volumes"][0]["persistentVolumeClaim"]["claimName"] = str(PVC_NAME)
|
||||
except KeyError as ex:
|
||||
raise Exception("Failed modifying template with PVC_NAME: {}".format(ex))
|
||||
# Return the edited template
|
||||
return {"template": template}
|
||||
createQueues: true
|
||||
queues:
|
||||
# Define a queue with an override `volumes` and `volumeMounts` section for binding a PVC
|
||||
pvc-test:
|
||||
templateOverrides:
|
||||
volumes:
|
||||
- name: task-pvc
|
||||
persistentVolumeClaim:
|
||||
# PVC_NAME placeholder. This will get replaced in the custom code module.
|
||||
claimName: PVC_NAME
|
||||
volumeMounts:
|
||||
- mountPath: "/tmp/task/"
|
||||
name: task-pvc
|
||||
```
|
||||
|
||||
### Example: Required Role
|
||||
|
||||
The following is an example of `custom-agent-role` Role with permissions to handle `persistentvolumeclaims`:
|
||||
|
||||
```yaml
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: Role
|
||||
metadata:
|
||||
name: custom-agent-role
|
||||
rules:
|
||||
- apiGroups:
|
||||
- ""
|
||||
resources:
|
||||
- persistentvolumeclaims
|
||||
verbs:
|
||||
- get
|
||||
- list
|
||||
- watch
|
||||
- create
|
||||
- patch
|
||||
- delete
|
||||
```
|
||||
@@ -0,0 +1,61 @@
|
||||
---
|
||||
title: Basic Deployment - Suggested GPU Operator Values
|
||||
---
|
||||
|
||||
This guide provides recommended configuration values for deploying the NVIDIA GPU Operator alongside ClearML Enterprise.
|
||||
|
||||
## Add the Helm Repo Locally
|
||||
|
||||
Add the NVIDIA GPU Operator Helm repository:
|
||||
|
||||
```bash
|
||||
helm repo add nvidia https://nvidia.github.io/gpu-operator
|
||||
```
|
||||
|
||||
Update the repository locally:
|
||||
```bash
|
||||
helm repo update
|
||||
```
|
||||
|
||||
## Installation
|
||||
|
||||
To prevent unprivileged containers from bypassing the Kubernetes Device Plugin API, configure the GPU operator
|
||||
using the following `gpu-operator.override.yaml` file:
|
||||
|
||||
```yaml
|
||||
toolkit:
|
||||
env:
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
|
||||
value: "false"
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
|
||||
value: "true"
|
||||
devicePlugin:
|
||||
env:
|
||||
- name: PASS_DEVICE_SPECS
|
||||
value: "true"
|
||||
- name: FAIL_ON_INIT_ERROR
|
||||
value: "true"
|
||||
- name: DEVICE_LIST_STRATEGY # Use volume-mounts
|
||||
value: volume-mounts
|
||||
- name: DEVICE_ID_STRATEGY
|
||||
value: uuid
|
||||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
value: all
|
||||
- name: NVIDIA_DRIVER_CAPABILITIES
|
||||
value: all
|
||||
```
|
||||
|
||||
Install the `gpu-operator`:
|
||||
|
||||
``` bash
|
||||
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
|
||||
```
|
||||
|
||||
## Fractional GPU Support
|
||||
|
||||
To enable fractional GPU allocation or manage mixed GPU configurations, refer to the following guides:
|
||||
* [ClearML Dynamic MIG Operator](../fractional_gpus/cdmo.md) (CDMO) – Dynamically configures MIG GPUs on supported devices.
|
||||
* [ClearML Enterprise Fractional GPU Injector](../fractional_gpus/cfgi.md) (CFGI) – Enables fractional (non-MIG) GPU
|
||||
allocation for better hardware utilization and workload distribution in Kubernetes.
|
||||
* [CDMO and CFGI on the same Cluster](../fractional_gpus/cdmo_cfgi_same_cluster.md) - In clusters with multiple nodes and
|
||||
varying GPU types, the `gpu-operator` can be used to manage different device configurations and fractioning modes.
|
||||
@@ -0,0 +1,27 @@
|
||||
---
|
||||
title: Multi-Node Training
|
||||
---
|
||||
|
||||
The ClearML Enterprise Agent supports horizontal multi-node training, allowing a single Task to run across multiple pods
|
||||
on different nodes.
|
||||
|
||||
Below is a configuration example using `clearml-agent-values.override.yaml`:
|
||||
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
# Cluster access is required to run multi-node Tasks
|
||||
serviceAccountClusterAccess: true
|
||||
multiNode:
|
||||
enabled: true
|
||||
createQueues: true
|
||||
queues:
|
||||
multi-node-example:
|
||||
queueSettings:
|
||||
# Defines the distribution of GPUs Tasks across multiple nodes. The format [x, y, ...] specifies the distribution of Tasks as 'x' GPUs on a node and 'y' GPUs on another node. Multiple Pods will be spawned respectively based on the lowest-common-denominator defined.
|
||||
multiNode: [ 4, 2 ]
|
||||
templateOverrides:
|
||||
resources:
|
||||
limits:
|
||||
# Note you will need to use the lowest-common-denominator of the GPUs distribution defined in `queueSettings.multiNode`.
|
||||
nvidia.com/gpu: 2
|
||||
```
|
||||
@@ -0,0 +1,72 @@
|
||||
---
|
||||
title: ClearML Presign Service
|
||||
---
|
||||
|
||||
The ClearML Presign Service is a secure service that generates and redirects pre-signed storage URLs for authenticated
|
||||
users, enabling direct access to cloud-hosted data (e.g., S3) without exposing credentials.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- A ClearML Enterprise Server is up and running.
|
||||
- API credentials (`<ACCESS_KEY>` and `<SECRET_KEY>`) generated via
|
||||
the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). For more information, see [ClearML API Credentials](../../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials).
|
||||
|
||||
:::note
|
||||
Make sure these credentials belong to an admin user or a service user with admin privileges.
|
||||
:::
|
||||
|
||||
- The worker environment must be able to access the ClearML Server over the same network.
|
||||
- Token to access `clearml-enterprise` Helm chart repo
|
||||
|
||||
## Installation
|
||||
|
||||
### Add the Helm Repo Locally
|
||||
|
||||
Add the ClearML Helm repository:
|
||||
```bash
|
||||
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <HELM_REPO_TOKEN> --password <HELM_REPO_TOKEN>
|
||||
```
|
||||
|
||||
Update the repository locally:
|
||||
```bash
|
||||
helm repo update
|
||||
```
|
||||
|
||||
### Prepare Configuration
|
||||
|
||||
Create a `presign-service.override.yaml` file (make sure to replace the placeholders):
|
||||
|
||||
```yaml
|
||||
imageCredentials:
|
||||
password: "<CLEARML_DOCKERHUB_TOKEN>"
|
||||
clearml:
|
||||
apiServerUrlReference: "<CLEARML_API_SERVER_URL>"
|
||||
apiKey: "<ACCESS_KEY>"
|
||||
apiSecret: "<SECRET_KEY>"
|
||||
ingress:
|
||||
enabled: true
|
||||
hostName: "<PRESIGN_SERVICE_URL>"
|
||||
```
|
||||
|
||||
### Deploy the Helm Chart
|
||||
|
||||
Install the `clearml-presign-service` Helm chart in the same namespace as the ClearML Enterprise server:
|
||||
|
||||
```bash
|
||||
helm install -n clearml clearml-presign-service clearml-enterprise/clearml-presign-service -f presign-service.override.yaml
|
||||
```
|
||||
|
||||
### Update ClearML Enterprise Server Configuration
|
||||
|
||||
Enable the ClearML Server to use the Presign Service by editing your `clearml-values.override.yaml` file.
|
||||
Add the following to the `apiserver.extraEnvs` section (make sure to replace `<PRESIGN_SERVICE_URL>`).
|
||||
|
||||
```yaml
|
||||
apiserver:
|
||||
extraEnvs:
|
||||
- name: CLEARML__SERVICES__SYSTEM__COMPANY__DEFAULT__SERVICES
|
||||
value: "[{\"type\":\"presign\",\"url\":\"https://<PRESIGN_SERVICE_URL>\",\"use_fallback\":\"false\",\"match_sets\":[{\"rules\":[{\"field\":\"\",\"obj_type\":\"\",\"regex\":\"^s3://\"}]}]}]"
|
||||
```
|
||||
|
||||
Apply the changes with a Helm upgrade.
|
||||
|
||||
@@ -0,0 +1,201 @@
|
||||
---
|
||||
title: ClearML Tenant with Self Signed Certificates
|
||||
---
|
||||
|
||||
This guide covers how to configure the [AI Application Gateway](#ai-application-gateway) and [ClearML Agent](#clearml-agent)
|
||||
to use self-signed or custom SSL certificates.
|
||||
|
||||
## AI Application Gateway
|
||||
|
||||
To configure certificates for the Application Gateway, update your `clearml-app-gateway-values.override.yaml` file:
|
||||
|
||||
```yaml
|
||||
# -- Custom certificates
|
||||
customCertificates:
|
||||
# -- Override system crt certificate bundle. Mutual exclusive with extraCerts.
|
||||
overrideCaCertificatesCrt:
|
||||
# -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt.
|
||||
extraCerts:
|
||||
- alias: certificateName
|
||||
pem: |
|
||||
-----BEGIN CERTIFICATE-----
|
||||
###
|
||||
-----END CERTIFICATE-----
|
||||
```
|
||||
|
||||
You have two configuration options:
|
||||
|
||||
- [**Replace**](#replace-entire-ca-certificatescrt-file) the entire `ca-certificates.crt` file
|
||||
- [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt) extra certificates to the existing `ca-certificates.crt`
|
||||
|
||||
|
||||
### Replace Entire `ca-certificates.crt` File
|
||||
|
||||
To replace the whole ca-bundle, provide a concatenated list of all trusted CA certificates in `pem` format as
|
||||
they are stored in a standard `ca-certificates.crt`.
|
||||
|
||||
```yaml
|
||||
# -- Custom certificates
|
||||
customCertificates:
|
||||
# -- Override system crt certificate bundle. Mutual exclusive with extraCerts.
|
||||
overrideCaCertificatesCrt: |
|
||||
-----BEGIN CERTIFICATE-----
|
||||
### CERT 1
|
||||
-----END CERTIFICATE-----
|
||||
-----BEGIN CERTIFICATE-----
|
||||
### CERT 2
|
||||
-----END CERTIFICATE-----
|
||||
-----BEGIN CERTIFICATE-----
|
||||
### CERT 3
|
||||
-----END CERTIFICATE-----
|
||||
...
|
||||
```
|
||||
|
||||
### Append Extra Certificates to the Existing `ca-certificates.crt`
|
||||
|
||||
You can add certificates to the existing CA bundle. Each certificate must have a unique `alias`.
|
||||
|
||||
```yaml
|
||||
# -- Custom certificates
|
||||
customCertificates:
|
||||
# -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt.
|
||||
extraCerts:
|
||||
- alias: certificate-name-1
|
||||
pem: |
|
||||
-----BEGIN CERTIFICATE-----
|
||||
###
|
||||
-----END CERTIFICATE-----
|
||||
- alias: certificate-name-2
|
||||
pem: |
|
||||
-----BEGIN CERTIFICATE-----
|
||||
###
|
||||
-----END CERTIFICATE-----
|
||||
```
|
||||
|
||||
### Apply Changes
|
||||
|
||||
To apply the changes, run the update command:
|
||||
|
||||
```bash
|
||||
helm upgrade -i <RELEASE_NAME> -n <WORKLOAD_NAMESPACE> clearml-enterprise/clearml-enterprise-app-gateway --version <CHART_VERSION> -f clearml-app-gateway-values.override.yaml
|
||||
```
|
||||
|
||||
## ClearML Agent
|
||||
|
||||
For the ClearML Agent, configure certificates in the `clearml-agent-values.override.yaml` file:
|
||||
|
||||
```yaml
|
||||
# -- Custom certificates
|
||||
customCertificates:
|
||||
# -- Override system crt certificate bundle. Mutual exclusive with extraCerts.
|
||||
overrideCaCertificatesCrt:
|
||||
# -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt.
|
||||
extraCerts:
|
||||
- alias: certificateName
|
||||
pem: |
|
||||
-----BEGIN CERTIFICATE-----
|
||||
###
|
||||
-----END CERTIFICATE-----
|
||||
```
|
||||
|
||||
You have two configuration options:
|
||||
|
||||
- [**Replace**](#replace-entire-ca-certificatescrt-file-1) the entire `ca-certificates.crt` file
|
||||
- [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt-1) extra certificates to the existing `ca-certificates.crt`
|
||||
|
||||
|
||||
### Replace Entire `ca-certificates.crt` File
|
||||
|
||||
To replace the whole ca-bundle, provide a concatenated list of all trusted CA certificates in `pem` format as
|
||||
they are stored in a standard `ca-certificates.crt`.
|
||||
|
||||
|
||||
```yaml
|
||||
# -- Custom certificates
|
||||
customCertificates:
|
||||
# -- Override system crt certificate bundle. Mutual exclusive with extraCerts.
|
||||
overrideCaCertificatesCrt: |
|
||||
-----BEGIN CERTIFICATE-----
|
||||
### CERT 1
|
||||
-----END CERTIFICATE-----
|
||||
-----BEGIN CERTIFICATE-----
|
||||
### CERT 2
|
||||
-----END CERTIFICATE-----
|
||||
-----BEGIN CERTIFICATE-----
|
||||
### CERT 3
|
||||
-----END CERTIFICATE-----
|
||||
...
|
||||
```
|
||||
|
||||
### Append Extra Certificates to the Existing `ca-certificates.crt`
|
||||
|
||||
You can add certificates to the existing CA bundle. Each certificate must have a unique `alias`.
|
||||
|
||||
```yaml
|
||||
# -- Custom certificates
|
||||
customCertificates:
|
||||
# -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt.
|
||||
extraCerts:
|
||||
- alias: certificate-name-1
|
||||
pem: |
|
||||
-----BEGIN CERTIFICATE-----
|
||||
###
|
||||
-----END CERTIFICATE-----
|
||||
- alias: certificate-name-2
|
||||
pem: |
|
||||
-----BEGIN CERTIFICATE-----
|
||||
###
|
||||
-----END CERTIFICATE-----
|
||||
```
|
||||
|
||||
### Add Certificates to Task Pods
|
||||
|
||||
If your workloads need access to these certificates (e.g., for HTTPS requests), configure the agent to inject them into pods:
|
||||
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
basePodTemplate:
|
||||
initContainers:
|
||||
- command:
|
||||
- /bin/sh
|
||||
- -c
|
||||
- update-ca-certificates
|
||||
image: allegroai/clearml-enterprise-agent-k8s-base:<AGENT-VERSION-AVAIABLE-ON-REPO>
|
||||
imagePullPolicy: IfNotPresent
|
||||
name: init-task
|
||||
volumeMounts:
|
||||
- name: etc-ssl-certs
|
||||
mountPath: "/etc/ssl/certs"
|
||||
- name: clearml-extra-ca-certs
|
||||
mountPath: "/usr/local/share/ca-certificates"
|
||||
env:
|
||||
- name: REQUESTS_CA_BUNDLE
|
||||
value: /etc/ssl/certs/ca-certificates.crt
|
||||
volumeMounts:
|
||||
- name: etc-ssl-certs
|
||||
mountPath: "/etc/ssl/certs"
|
||||
volumes:
|
||||
- name: etc-ssl-certs
|
||||
emptyDir: {}
|
||||
- name: clearml-extra-ca-certs
|
||||
projected:
|
||||
defaultMode: 420
|
||||
sources:
|
||||
# LIST HERE CONFIGMAPS CREATED BY THE AGENT CHART, THE CARDINALITY DEPENDS ON THE NUMBER OF CERTS PROVIDED.
|
||||
- configMap:
|
||||
name: clearml-agent-clearml-enterprise-agent-custom-ca-0
|
||||
- configMap:
|
||||
name: clearml-agent-clearml-enterprise-agent-custom-ca-1
|
||||
```
|
||||
|
||||
The `clearml-extra-ca-certs` volume must include all `ConfigMap` resources generated by the agent for the custom certificates.
|
||||
These `ConfigMaps` are automatically created by the Helm chart based on the number of certificates provided.
|
||||
Their names are usually prefixed with the Helm release name, so adjust accordingly if you used a custom release name.
|
||||
|
||||
### Apply Changes
|
||||
|
||||
Apply the changes by running the update command:
|
||||
|
||||
``` bash
|
||||
helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml
|
||||
```
|
||||
@@ -0,0 +1,73 @@
|
||||
---
|
||||
title: SSO (Identity Provider) Setup
|
||||
---
|
||||
|
||||
ClearML Enterprise Server supports various Single Sign-On (SSO) identity providers.
|
||||
SSO configuration is managed via environment variables in your `clearml-values.override.yaml` file and is applied to the
|
||||
`apiserver` component.
|
||||
|
||||
The following are configuration examples for commonly used providers. Other supported systems include:
|
||||
* Auth0
|
||||
* Keycloak
|
||||
* Okta
|
||||
* Azure AD
|
||||
* Google
|
||||
* AWS Cognito
|
||||
|
||||
## Auth0
|
||||
|
||||
```yaml
|
||||
apiserver:
|
||||
extraEnvs:
|
||||
- name: CLEARML__secure__login__sso__oauth_client__auth0__client_id
|
||||
value: "<AUTH0_CLIENT_ID>"
|
||||
- name: CLEARML__secure__login__sso__oauth_client__auth0__client_secret
|
||||
value: "<AUTH0_CLIENT_SECRET>"
|
||||
- name: CLEARML__services__login__sso__oauth_client__auth0__base_url
|
||||
value: "<AUTH0_BASE_URL>"
|
||||
- name: CLEARML__services__login__sso__oauth_client__auth0__authorize_url
|
||||
value: "<AUTH0_AUTHORIZE_URL>"
|
||||
- name: CLEARML__services__login__sso__oauth_client__auth0__access_token_url
|
||||
value: "<AUTH0_ACCESS_TOKEN_URL>"
|
||||
- name: CLEARML__services__login__sso__oauth_client__auth0__audience
|
||||
value: "<AUTH0_AUDIENCE>"
|
||||
```
|
||||
|
||||
## Keycloak
|
||||
|
||||
```yaml
|
||||
apiserver:
|
||||
extraEnvs:
|
||||
- name: CLEARML__secure__login__sso__oauth_client__keycloak__client_id
|
||||
value: "<KC_CLIENT_ID>"
|
||||
- name: CLEARML__secure__login__sso__oauth_client__keycloak__client_secret
|
||||
value: "<KC_SECRET_ID>"
|
||||
- name: CLEARML__services__login__sso__oauth_client__keycloak__base_url
|
||||
value: "<KC_URL>/realms/<REALM_NAME>/"
|
||||
- name: CLEARML__services__login__sso__oauth_client__keycloak__authorize_url
|
||||
value: "<KC_URL>/realms/<REALM_NAME>/protocol/openid-connect/auth"
|
||||
- name: CLEARML__services__login__sso__oauth_client__keycloak__access_token_url
|
||||
value: "<KC_URL>/realms/<REALM_NAME>/protocol/openid-connect/token"
|
||||
- name: CLEARML__services__login__sso__oauth_client__keycloak__idp_logout
|
||||
value: "true"
|
||||
```
|
||||
|
||||
## Group Membership Mapping in Keycloak
|
||||
|
||||
To map Keycloak groups into the ClearML user's SSO token:
|
||||
|
||||
1. Go to the **Client Scopes** tab.
|
||||
1. Click on the `<clearml client>-dedicated` scope.
|
||||
1. Click **Add Mapper > By Configuration > Group Membership**
|
||||
1. Configure the mapper:
|
||||
* Select the **Name** "groups"
|
||||
* Set **Token Claim Name** "groups"
|
||||
* Uncheck the **Full group path**
|
||||
* Save the mapper.
|
||||
|
||||
To verify:
|
||||
|
||||
1. Go to the **Client Details > Client scope** tab.
|
||||
1. Go to the **Evaluate** sub-tab and select a user with any group memberships.
|
||||
1. Go to **Generated ID Token** and then to **Generated User Info**.
|
||||
1. Inspect that in both cases you can see the group's claim in the displayed user data.
|
||||
132
docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo.md
Normal file
132
docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo.md
Normal file
@@ -0,0 +1,132 @@
|
||||
---
|
||||
title: ClearML Dynamic MIG Operator (CDMO)
|
||||
---
|
||||
|
||||
The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG (Multi-Instance GPU) configurations.
|
||||
|
||||
## Installation
|
||||
|
||||
### Requirements
|
||||
|
||||
* Add and update the Nvidia Helm repo:
|
||||
|
||||
```bash
|
||||
helm repo add nvidia https://nvidia.github.io/gpu-operator
|
||||
helm repo update
|
||||
```
|
||||
|
||||
* Create a `gpu-operator.override.yaml` file with the following content:
|
||||
|
||||
```yaml
|
||||
migManager:
|
||||
enabled: false
|
||||
mig:
|
||||
strategy: mixed
|
||||
toolkit:
|
||||
env:
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
|
||||
value: "false"
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
|
||||
value: "true"
|
||||
devicePlugin:
|
||||
env:
|
||||
- name: PASS_DEVICE_SPECS
|
||||
value: "true"
|
||||
- name: FAIL_ON_INIT_ERROR
|
||||
value: "true"
|
||||
- name: DEVICE_LIST_STRATEGY # Use volume-mounts
|
||||
value: volume-mounts
|
||||
- name: DEVICE_ID_STRATEGY
|
||||
value: uuid
|
||||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
value: all
|
||||
- name: NVIDIA_DRIVER_CAPABILITIES
|
||||
value: all
|
||||
```
|
||||
|
||||
* Install the NVIDIA `gpu-operator` using Helm with the previous configuration:
|
||||
|
||||
```bash
|
||||
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
|
||||
```
|
||||
|
||||
### Installing CDMO
|
||||
|
||||
1. Create a `cdmo-values.override.yaml` file with the following content:
|
||||
|
||||
```yaml
|
||||
imageCredentials:
|
||||
password: "<CLEARML_DOCKERHUB_TOKEN>"
|
||||
```
|
||||
|
||||
1. Install the CDMO Helm Chart using the previous override file:
|
||||
|
||||
```bash
|
||||
helm install -n cdmo cdmo clearml-enterprise/clearml-dynamic-mig-operator --create-namespace -f cdmo-values.override.yaml
|
||||
```
|
||||
|
||||
1. Enable the NVIDIA MIG support on your cluster by running the following command on all nodes with a MIG-supported GPU
|
||||
(run it for each GPU `<GPU_ID>` on the host):
|
||||
|
||||
```bash
|
||||
nvidia-smi -mig 1
|
||||
```
|
||||
|
||||
:::note notes
|
||||
* A node reboot may be required if the command output indicates so.
|
||||
|
||||
* For convenience, this command can be run from within the `nvidia-device-plugin-daemonset` pod running on the related node.
|
||||
:::
|
||||
|
||||
1. Label all MIG-enabled GPU node `<NODE_NAME>` from the previous step:
|
||||
|
||||
```bash
|
||||
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning=mig"
|
||||
```
|
||||
|
||||
## Disabling MIGs
|
||||
|
||||
To disable MIG mode and restore standard full-GPU access:
|
||||
|
||||
1. Ensure no running workflows are using GPUs on the target node(s).
|
||||
|
||||
2. Remove the CDMO label from the target node(s) to disable the dynamic MIG reconfiguration.
|
||||
|
||||
```bash
|
||||
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning-"
|
||||
```
|
||||
|
||||
3. Execute a shell into the `device-plugin-daemonset` pod instance running on the target node(s) and execute the following commands:
|
||||
|
||||
```bash
|
||||
nvidia-smi mig -dci
|
||||
|
||||
nvidia-smi mig -dgi
|
||||
|
||||
nvidia-smi -mig 0
|
||||
```
|
||||
|
||||
4. Edit the `gpu-operator.override.yaml` file to restore full-GPU access, and upgrade the `gpu-operator`:
|
||||
|
||||
```yaml
|
||||
toolkit:
|
||||
env:
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
|
||||
value: "false"
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
|
||||
value: "true"
|
||||
devicePlugin:
|
||||
env:
|
||||
- name: PASS_DEVICE_SPECS
|
||||
value: "true"
|
||||
- name: FAIL_ON_INIT_ERROR
|
||||
value: "true"
|
||||
- name: DEVICE_LIST_STRATEGY # Use volume-mounts
|
||||
value: volume-mounts
|
||||
- name: DEVICE_ID_STRATEGY
|
||||
value: uuid
|
||||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
value: all
|
||||
- name: NVIDIA_DRIVER_CAPABILITIES
|
||||
value: all
|
||||
```
|
||||
@@ -0,0 +1,123 @@
|
||||
---
|
||||
title: Install CDMO and CFGI on the Same Cluster
|
||||
---
|
||||
|
||||
You can install both CDMO (ClearML Dynamic MIG Orchestrator) and CFGI (ClearML Fractional GPU Injector) on a shared Kubernetes cluster.
|
||||
In clusters with multiple nodes and varying GPU types, the `gpu-operator` can be used to manage different device configurations
|
||||
and fractioning modes.
|
||||
|
||||
## Configuring the NVIDIA GPU Operator
|
||||
|
||||
The NVIDIA `gpu-operator` supports defining multiple configurations for the Device Plugin.
|
||||
|
||||
The following example YAML defines two configurations: "mig" and "ts" (time-slicing).
|
||||
|
||||
```yaml
|
||||
migManager:
|
||||
enabled: false
|
||||
mig:
|
||||
strategy: mixed
|
||||
toolkit:
|
||||
env:
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
|
||||
value: "false"
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
|
||||
value: "true"
|
||||
devicePlugin:
|
||||
enabled: true
|
||||
env:
|
||||
- name: PASS_DEVICE_SPECS
|
||||
value: "true"
|
||||
- name: FAIL_ON_INIT_ERROR
|
||||
value: "true"
|
||||
- name: DEVICE_LIST_STRATEGY
|
||||
value: volume-mounts
|
||||
- name: DEVICE_ID_STRATEGY
|
||||
value: uuid
|
||||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
value: all
|
||||
- name: NVIDIA_DRIVER_CAPABILITIES
|
||||
value: all
|
||||
config:
|
||||
name: device-plugin-config
|
||||
create: true
|
||||
default: "all-disabled"
|
||||
data:
|
||||
all-disabled: |-
|
||||
version: v1
|
||||
flags:
|
||||
migStrategy: none
|
||||
ts: |-
|
||||
version: v1
|
||||
flags:
|
||||
migStrategy: none
|
||||
sharing:
|
||||
timeSlicing:
|
||||
renameByDefault: false
|
||||
failRequestsGreaterThanOne: false
|
||||
# Edit the following configuration as needed, adding as many GPU indices as many cards are installed on the Host.
|
||||
resources:
|
||||
- name: nvidia.com/gpu
|
||||
rename: nvidia.com/gpu-0
|
||||
devices:
|
||||
- "0"
|
||||
replicas: 8
|
||||
mig: |-
|
||||
version: v1
|
||||
flags:
|
||||
migStrategy: mixed
|
||||
```
|
||||
|
||||
## Applying Configuration to Nodes
|
||||
|
||||
Label each Kubernetes node accordingly to activate a specific GPU mode:
|
||||
|
||||
|Mode| Label command|
|
||||
|----|-----|
|
||||
| `mig` | `kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=mig` |
|
||||
| `ts` (time slicing) | `kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=ts` |
|
||||
| Standard full-GPU access | `kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled` |
|
||||
|
||||
After a node is labeled, the NVIDIA `device-plugin` will automatically reload the new configuration.
|
||||
|
||||
## Installing CDMO and CFGI
|
||||
|
||||
After configuring the NVIDIA `gpu-operator` and labeling nodes, proceed with the standard installations of [CDMO](cdmo.md)
|
||||
and [CFGI](cfgi.md).
|
||||
|
||||
## Disabling Configurations
|
||||
|
||||
### Time Slicing
|
||||
|
||||
To disable time-slicing and use full GPU access, update the node label using the `--overwrite` flag:
|
||||
|
||||
```bash
|
||||
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled --overwrite
|
||||
```
|
||||
|
||||
### MIG
|
||||
|
||||
To disable MIG mode:
|
||||
|
||||
1. Ensure there are no more running workflows requesting any form of GPU on the node(s).
|
||||
2. Remove the CDMO label from the target node(s).
|
||||
|
||||
```bash
|
||||
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning-"
|
||||
```
|
||||
|
||||
3. Execute a shell in the `device-plugin-daemonset` pod instance running on the target node(s) and execute the following commands:
|
||||
|
||||
```bash
|
||||
nvidia-smi mig -dci
|
||||
|
||||
nvidia-smi mig -dgi
|
||||
|
||||
nvidia-smi -mig 0
|
||||
```
|
||||
|
||||
4. Label the node to use standard (non-MIG) GPU mode:
|
||||
|
||||
```bash
|
||||
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled --overwrite
|
||||
```
|
||||
305
docs/deploying_clearml/enterprise_deploy/fractional_gpus/cfgi.md
Normal file
305
docs/deploying_clearml/enterprise_deploy/fractional_gpus/cfgi.md
Normal file
@@ -0,0 +1,305 @@
|
||||
---
|
||||
title: ClearML Fractional GPU Injector (CFGI)
|
||||
---
|
||||
|
||||
The **ClearML Enterprise Fractional GPU Injector** (CFGI) allows AI workloads to utilize fractional (non-MIG) GPU slices
|
||||
on Kubernetes clusters, maximizing hardware efficiency and performance.
|
||||
|
||||
## Installation
|
||||
|
||||
### Add the Local ClearML Helm Repository
|
||||
|
||||
```bash
|
||||
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <GITHUB_TOKEN> --password <GITHUB_TOKEN>
|
||||
helm repo update
|
||||
```
|
||||
|
||||
### Requirements
|
||||
|
||||
* Install the NVIDIA `gpu-operator` using Helm
|
||||
* Set the number of GPU slices to 8
|
||||
* Add and update the Nvidia Helm repo:
|
||||
|
||||
```bash
|
||||
helm repo add nvidia https://nvidia.github.io/gpu-operator
|
||||
helm repo update
|
||||
```
|
||||
|
||||
* Credentials for the ClearML Enterprise DockerHub repository
|
||||
|
||||
### GPU Operator Configuration
|
||||
|
||||
#### For CFGI Version >= 1.3.0
|
||||
|
||||
1. Create a Docker Registry secret named `clearml-dockerhub-access` in the `gpu-operator` namespace. Make sure to replace `<CLEARML_DOCKERHUB_TOKEN>` with your token.
|
||||
|
||||
```bash
|
||||
kubectl create secret -n gpu-operator docker-registry clearml-dockerhub-access \
|
||||
--docker-server=docker.io \
|
||||
--docker-username=allegroaienterprise \
|
||||
--docker-password="<CLEARML_DOCKERHUB_TOKEN>" \
|
||||
--docker-email=""
|
||||
```
|
||||
|
||||
1. Create a `gpu-operator.override.yaml` file as follows:
|
||||
* Set `devicePlugin.repository` to `docker.io/clearml`
|
||||
* Configure `devicePlugin.config.data.renamed-resources.sharing.timeSlicing.resources` for each GPU index on the host
|
||||
* Use `nvidia.com/gpu-<INDEX>` format for the `rename` field, and set `replicas` to `8`.
|
||||
|
||||
```yaml
|
||||
gfd:
|
||||
imagePullSecrets:
|
||||
- "clearml-dockerhub-access"
|
||||
toolkit:
|
||||
env:
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
|
||||
value: "false"
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
|
||||
value: "true"
|
||||
devicePlugin:
|
||||
repository: docker.io/clearml
|
||||
image: k8s-device-plugin
|
||||
version: v0.17.1-gpu-card-selection
|
||||
imagePullPolicy: Always
|
||||
imagePullSecrets:
|
||||
- "clearml-dockerhub-access"
|
||||
env:
|
||||
- name: PASS_DEVICE_SPECS
|
||||
value: "true"
|
||||
- name: FAIL_ON_INIT_ERROR
|
||||
value: "true"
|
||||
- name: DEVICE_LIST_STRATEGY # Use volume-mounts
|
||||
value: volume-mounts
|
||||
- name: DEVICE_ID_STRATEGY
|
||||
value: uuid
|
||||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
value: all
|
||||
- name: NVIDIA_DRIVER_CAPABILITIES
|
||||
value: all
|
||||
config:
|
||||
name: device-plugin-config
|
||||
create: true
|
||||
default: "renamed-resources"
|
||||
data:
|
||||
renamed-resources: |-
|
||||
version: v1
|
||||
flags:
|
||||
migStrategy: none
|
||||
sharing:
|
||||
timeSlicing:
|
||||
renameByDefault: false
|
||||
failRequestsGreaterThanOne: false
|
||||
# Edit the following configuration as needed, adding as many GPU indices as many cards are installed on the Host.
|
||||
resources:
|
||||
- name: nvidia.com/gpu
|
||||
rename: nvidia.com/gpu-0
|
||||
devices:
|
||||
- "0"
|
||||
replicas: 8
|
||||
- name: nvidia.com/gpu
|
||||
rename: nvidia.com/gpu-1
|
||||
devices:
|
||||
- "1"
|
||||
replicas: 8
|
||||
```
|
||||
|
||||
#### For CFGI version < 1.3.0 (Legacy)
|
||||
|
||||
Create a `gpu-operator.override.yaml` file:
|
||||
|
||||
```yaml
|
||||
toolkit:
|
||||
env:
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
|
||||
value: "false"
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
|
||||
value: "true"
|
||||
devicePlugin:
|
||||
env:
|
||||
- name: PASS_DEVICE_SPECS
|
||||
value: "true"
|
||||
- name: FAIL_ON_INIT_ERROR
|
||||
value: "true"
|
||||
- name: DEVICE_LIST_STRATEGY # Use volume-mounts
|
||||
value: volume-mounts
|
||||
- name: DEVICE_ID_STRATEGY
|
||||
value: uuid
|
||||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
value: all
|
||||
- name: NVIDIA_DRIVER_CAPABILITIES
|
||||
value: all
|
||||
config:
|
||||
name: device-plugin-config
|
||||
create: true
|
||||
default: "any"
|
||||
data:
|
||||
any: |-
|
||||
version: v1
|
||||
flags:
|
||||
migStrategy: none
|
||||
sharing:
|
||||
timeSlicing:
|
||||
renameByDefault: false
|
||||
failRequestsGreaterThanOne: false
|
||||
resources:
|
||||
- name: nvidia.com/gpu
|
||||
replicas: 8
|
||||
```
|
||||
|
||||
### Install GPU Operator and CFGI
|
||||
|
||||
1. Install the NVIDIA `gpu-operator` using the previously created `gpu-operator.override.yaml` file:
|
||||
|
||||
```bash
|
||||
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
|
||||
```
|
||||
|
||||
1. Create a `cfgi-values.override.yaml` file with the following content:
|
||||
|
||||
```yaml
|
||||
imageCredentials:
|
||||
password: "<CLEARML_DOCKERHUB_TOKEN>"
|
||||
```
|
||||
|
||||
1. Install the CFGI Helm Chart using the previous override file:
|
||||
|
||||
```bash
|
||||
helm install -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector --create-namespace -f cfgi-values.override.yaml
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
To use fractional GPUs, label your pod with:
|
||||
|
||||
```yaml
|
||||
labels:
|
||||
clearml-injector/fraction: "<GPU_FRACTION_VALUE>"
|
||||
```
|
||||
|
||||
Valid values for `"<GPU_FRACTION_VALUE>"` include:
|
||||
|
||||
* Fractions:
|
||||
* "0.0625" (1/16th)
|
||||
* "0.125" (1/8th)
|
||||
* "0.250"
|
||||
* "0.375"
|
||||
* "0.500"
|
||||
* "0.625"
|
||||
* "0.750"
|
||||
* "0.875"
|
||||
* Integer representation of GPUs such as `1.000`, `2`, `2.0`, etc.
|
||||
|
||||
### ClearML Agent Configuration
|
||||
|
||||
To run ClearML jobs with fractional GPU allocation, configure your queues in accordingly in your `clearml-agent-values.override.yaml` file.
|
||||
|
||||
Each queue should include a `templateOverride` that sets the `clearml-injector/fraction` label, which determines the
|
||||
fraction of a GPU to allocate (e.g., "0.500" for half a GPU).
|
||||
|
||||
This label is used by the CFGI to assign the correct portion of GPU resources to the pod running the task.
|
||||
|
||||
#### CFGI Version >= 1.3.0
|
||||
|
||||
Starting from version 1.3.0, there is no need to specify the resources field. You only need to set the labels:
|
||||
|
||||
|
||||
``` yaml
|
||||
agentk8sglue:
|
||||
createQueues: true
|
||||
queues:
|
||||
gpu-fraction-1_000:
|
||||
templateOverrides:
|
||||
labels:
|
||||
clearml-injector/fraction: "1.000"
|
||||
gpu-fraction-0_500:
|
||||
templateOverrides:
|
||||
labels:
|
||||
clearml-injector/fraction: "0.500"
|
||||
gpu-fraction-0_250:
|
||||
templateOverrides:
|
||||
labels:
|
||||
clearml-injector/fraction: "0.250"
|
||||
gpu-fraction-0_125:
|
||||
templateOverrides:
|
||||
labels:
|
||||
clearml-injector/fraction: "0.125"
|
||||
```
|
||||
|
||||
#### CFGI Version < 1.3.0
|
||||
|
||||
For versions older than 1.3.0, the GPU limits must be defined:
|
||||
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
createQueues: true
|
||||
queues:
|
||||
gpu-fraction-1_000:
|
||||
templateOverrides:
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 8
|
||||
gpu-fraction-0_500:
|
||||
templateOverrides:
|
||||
labels:
|
||||
clearml-injector/fraction: "0.500"
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 4
|
||||
gpu-fraction-0_250:
|
||||
templateOverrides:
|
||||
labels:
|
||||
clearml-injector/fraction: "0.250"
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 2
|
||||
gpu-fraction-0_125:
|
||||
templateOverrides:
|
||||
labels:
|
||||
clearml-injector/fraction: "0.125"
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
```
|
||||
|
||||
## Upgrading CFGI Chart
|
||||
|
||||
To upgrade to the latest chart version:
|
||||
|
||||
```bash
|
||||
helm repo update
|
||||
helm upgrade -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector
|
||||
```
|
||||
|
||||
To apply new values to an existing installation:
|
||||
|
||||
```bash
|
||||
helm upgrade -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector -f cfgi-values.override.yaml
|
||||
```
|
||||
|
||||
## Disabling Fractions
|
||||
|
||||
To revert to standard GPU scheduling (without time slicing), remove the `devicePlugin.config` section from the `gpu-operator.override.yaml`
|
||||
file and upgrade the `gpu-operator`:
|
||||
|
||||
```yaml
|
||||
toolkit:
|
||||
env:
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
|
||||
value: "false"
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
|
||||
value: "true"
|
||||
devicePlugin:
|
||||
env:
|
||||
- name: PASS_DEVICE_SPECS
|
||||
value: "true"
|
||||
- name: FAIL_ON_INIT_ERROR
|
||||
value: "true"
|
||||
- name: DEVICE_LIST_STRATEGY # Use volume-mounts
|
||||
value: volume-mounts
|
||||
- name: DEVICE_ID_STRATEGY
|
||||
value: uuid
|
||||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
value: all
|
||||
- name: NVIDIA_DRIVER_CAPABILITIES
|
||||
value: all
|
||||
```
|
||||
Reference in New Issue
Block a user