Additional config options

This commit is contained in:
revital 2025-05-13 13:33:57 +03:00
parent c01766f852
commit b21275c262
11 changed files with 1626 additions and 0 deletions

View File

@ -0,0 +1,181 @@
🟡 Ready, but missing hyperlinks (see TODOs)
TODO:
- Link: GPU Operator
- Link: Additional configurations
- Link: Now proceed with AI App Gateway
---
title: ClearML Agent on Kubernetes
---
The ClearML Agent allows scheduling distributed experiments on a Kubernetes cluster.
## Prerequisites
- The ClearML Enterprise server is up and running.
- Create a set of `<ACCESS_KEY>` and `<SECRET_KEY>` credentials in the ClearML Server. The easiest way to do so is from
the ClearML UI (**Settings > Workspace > App Credentials > Create new credentials**).
:::note
Make sure that the generated keys belong to an admin user or a service user with admin privileges.
:::
- The worker environment should be able to communicate to the ClearML Server over the same network.
## Installation
### Add the Helm Repo Locally
Add the ClearML Helm repository:
```bash
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <HELM_REPO_TOKEN> --password <HELM_REPO_TOKEN>
```
Update the repository locally:
```bash
helm repo update
```
### Prepare Values
Create a `clearml-agent-values.override.yaml` file with the following content:
:::note
In the following configuration, replace the `<ACCESS_KEY>` and `<SECRET_KEY>` placeholders with the admin credentials
you have generated on the ClearML Server. The values for `<api|file|web>ServerUrlReference` should point to your ClearML
control-plane installation.
:::
```yaml
imageCredentials:
password: "<CLEARML_DOCKERHUB_TOKEN>"
clearml:
agentk8sglueKey: "<ACCESS_KEY>"
agentk8sglueSecret: "<SECRET_KEY>"
agentk8sglue:
apiServerUrlReference: "<CLEARML_API_SERVER_REFERENCE>"
fileServerUrlReference: "<CLEARML_FILE_SERVER_REFERENCE>"
webServerUrlReference: "<CLEARML_WEB_SERVER_REFERENCE>"
createQueues: true
queues:
exampleQueue:
templateOverrides: {}
queueSettings: {}
```
### Install the Chart
Install the ClearML Enterprise Agent Helm chart using the previous values override file:
```bash
helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml
```
## Additional Configuration Options
:::note
You can view the full set of available and documented values of the chart by running the following command:
```bash
helm show readme clearml-enterprise/clearml-enterprise-agent
# or
helm show values clearml-enterprise/clearml-enterprise-agent
```
:::
### Report GPUs in the Dashboard
The Agent should explicitly report the total number of GPUs available on the cluster for it to appear in the dashboard reporting:
```yaml
agentk8sglue:
# -- Agent reporting to Dashboard max GPU available. This will report 2 GPUs.
dashboardReportMaxGpu: 2
```
### Queues
The ClearML Agent in Kubernetes monitors ClearML queues and pulls tasks that are scheduled for execution.
A single agent can monitor multiple queues, each queue sharing a Pod template (`agentk8sglue.basePodTemplate`) to be
used when submitting a task to Kubernetes after it has been extracted from the queue.
Each queue can be configured with dedicated Pod template spec override (`templateOverrides`). This way queue definitions
can be mixed and matched to serve multiple use-cases.
The Following are a few examples of agent queue templates.
#### GPU Queues
:::note
The GPU queues configuration and usage from the ClearML Enterprise Agent requires deploying the Nvidia GPU Operator
on your Kubernetes cluster.
For more information, refer to the [GPU Operator](https://TODO) page.
:::
``` yaml
agentk8sglue:
createQueues: true
queues:
1xGPU:
templateOverrides:
resources:
limits:
nvidia.com/gpu: 1
2xGPU:
templateOverrides:
resources:
limits:
nvidia.com/gpu: 2
```
#### Override a Pod Template by Queue
In the following example:
- The `red` queue will inherit both the label `team=red` and the 1Gi memory limit from the `basePodTemplate` section.
- The `blue` queue will set the label `team=blue`, but will inherit the 1Gi memory from the `basePodTemplate` section.
- The `green` queue will set both the label `team=green` and a 2Gi memory limit. It will also set an annotation and an environment variable.
```yaml
agentk8sglue:
# Defines common template
basePodTemplate:
labels:
team: red
resources:
limits:
memory: 1Gi
createQueues: true
queues:
red:
# Does not override
templateOverrides: {}
blue:
# Overrides labels
templateOverrides:
labels:
team: blue
green:
# Overrides labels and resources, plus set new fields
templateOverrides:
labels:
team: green
annotations:
example: "example value"
resources:
limits:
memory: 2Gi
env:
- name: MY_ENV
value: "my_value"
```
## Next Steps
Once the ClearML Enterprise Agent is up and running, proceed with deploying the ClearML Enterprise App Gateway.
$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$
TODO link to the AI App Gateway page in documentation

View File

@ -0,0 +1,268 @@
🟢 Ready
TODO in future:
- Add NFS Example - https://allegroai.atlassian.net/wiki/x/AoCUiQ?atlOrigin=eyJpIjoiMjNiNTcxYTJiMzUxNDVhMThiODlhMTcwYzE1YWE3ZTUiLCJwIjoiYyJ9
---
# Dynamically Edit Task Pod Template
The ClearML Enterprise Agent supports defining custom Python code for interacting with a Task's Pod template before it gets applied to Kubernetes.
This allows to dynamically edit a Task Pod manifest in the context of a ClearML Enterprise Agent and can be useful in a variety of scenarios such as customizing fields based on variables.
# Agent Configuration
The `CLEARML_K8S_GLUE_TEMPLATE_MODULE` environment variable is used to indicate a Python module and function inside that module for the ClearML Enterprise Agent to run before applying a Task Pod template. The Agent will run this code from its own context, pass some arguments (including the actual template) to the function and use the returned template to create the final Task Pod in Kubernetes.
Arguments passed to the function include:
`queue` - ID of the queue (string) from which the task was pulled.
`queue_name` - Name of the queue (string) from which the task was pulled.
`template` - Base template (python dictionary) created from the agent’s values, with any specific overrides for the queue from which the task was pulled.
`task_data` - Task data structure (object) containing the task’s information (as returned by the tasks.get_by_id API call). For example, use task_data.project to get the task’s project ID.
`providers_info` - Providers info (dictionary) containing optional information collected for the user running this task when the user logged into the system (requires additional server configuration).
`task_config` - Task configuration (clearml_agent.backend_config.Config object) containing the configuration used to run this task. This includes any overrides added in Vaults applicable for the user running this task. Use task_config.get("...") to get specific configuration values.
`worker` - the agent Python object, in case custom calls might be required.
## Usage
Edit the `clearml-agent-values.override.yaml` file adding the following:
``` yaml
agentk8sglue:
extraEnvs:
- name: CLEARML_K8S_GLUE_TEMPLATE_MODULE
value: "custom_code:update_template"
fileMounts:
- name: "custom_code.py"
folderPath: "/root"
fileContent: |-
import json
from pprint import pformat
def update_template(queue, task_data, providers_info, template, task_config, worker, queue_name, *args, **kwargs):
print(pformat(template))
my_var_name = "foo"
my_var_value = "bar"
try:
template["spec"]["containers"][0]["env"][0]["name"] = str(my_var_name)
template["spec"]["containers"][0]["env"][0]["value"] = str(my_var_value)
except KeyError as ex:
raise Exception("Failed modifying template: {}".format(ex))
return {"template": template}
```
## Notes
**Note**: Make sure to include `*args, **kwargs` at the end of the function’s argument list and to only use keyword arguments. This is needed to maintain backward compatibility and make sure any added named arguments or changes in the arguments order in new agent versions won’t affect your implementation.
**Note**: Custom code modules can be included as a file in the Pod's container and the environment variable can be used to simply point to the file and entry point.
**Note**: When defining a custom code module, by default the ClearML Etnerprise Agent will start watching Pods in all namespaces across the Cluster. If you do not intend to give a ClusterRole permission, make sure to set the `CLEARML_K8S_GLUE_MONITOR_ALL_NAMESPACES` env to `"0"` to prevent the ClearML Enterprise Agent to try listing Pods in all namespaces. Instead, set it to `"1"` if namespace-related changes are needed in the code.
``` yaml
agentk8sglue:
extraEnvs:
- name: CLEARML_K8S_GLUE_MONITOR_ALL_NAMESPACES
value: "0"
```
**Note**: If you want instead to modify the Bash script used to start the Task Pod or the Agent, see here instead:
``` yaml
agentk8sglue:
# -- Custom Bash script for the Agent pod ran by Glue Agent
customBashScript: ""
# -- Custom Bash script for the Task Pods ran by Glue Agent
containerCustomBashScript: ""
```
# Examples
## Example – Edit Template based on ENV var
``` yaml
agentk8sglue:
extraEnvs:
- name: CLEARML_K8S_GLUE_TEMPLATE_MODULE
value: "custom_code:update_template"
fileMounts:
- name: "custom_code.py"
folderPath: "/root"
fileContent: |-
import json
from pprint import pformat
def update_template(queue, task_data, providers_info, template, task_config, worker, queue_name, *args, **kwargs):
print(pformat(template))
my_var = "some_var"
try:
template["spec"]["initContainers"][0]["command"][-1] = \
template["spec"]["initContainers"][0]["command"][-1].replace("MY_VAR", str(my_var))
template["spec"]["containers"][0]["volumeMounts"][0]["subPath"] = str(my_var)
except KeyError as ex:
raise Exception("Failed modifying template with MY_VAR: {}".format(ex))
return {"template": template}
basePodTemplate:
initContainers:
- name: myInitContainer
image: docker/ubuntu:18.04
command:
- /bin/bash
- -c
- >
echo MY_VAR;
volumeMounts:
- name: myTemplatedMount
mountPath: MY_VAR
volumes:
- name: myTemplatedMount
emptyDir: {}
```
## Example – NFS Mount Path
``` yaml
agentk8sglue:
extraEnvs:
- name: CLEARML_K8S_GLUE_TEMPLATE_MODULE
value: "custom_code:update_template"
fileMounts:
- name: "custom_code.py"
folderPath: "/root"
fileContent: |-
import json
from pprint import pformat
def update_template(queue, task_data, providers_info, template, task_config, worker, queue_name, *args, **kwargs):
nfs = task_config.get("nfs")
# ad_role = providers_info.get("ad-role")
if nfs: # and ad_role == "some-value"
print(pformat(template))
try:
template["spec"]["containers"][0]["volumeMounts"].append(
{"name": "custom-mount", "mountPath": nfs.get("mountPath")}
)
template["spec"]["containers"][0]["volumes"].append(
{"name": "custom-mount", "nfs": {"server": nfs.get("server.ip"), "path": nfs.get("server.path")}}
)
except KeyError as ex:
raise Exception("Failed modifying template: {}".format(ex))
return {"template": template}
```
# Bind Additional Resources to Task Pod (PVC Example)
In this example, a dedicated PVC is dynamically created and attached to every Pod created from a dedicated queue, then deleted after the Pod deletion.
The following code block is commented to explain the context.
The key points are:
- `CLEARML_K8S_GLUE_POD_PRE_APPLY_CMD` and `CLEARML_K8S_GLUE_POD_POST_DELETE_CMD` env vars let you define custom bash code hooks to be executed around the main apply command by the Agent, such as creating and deleting a PVC object.
- `CLEARML_K8S_GLUE_TEMPLATE_MODULE` env var and a file mount let you define custom Python code in a specific context, useful to dynamically update the main Pod template before the Agent applies it.
**Note**: This example uses a queue named `pvc-test`, make sure to replace all occurrences of it.
**Note**: `CLEARML_K8S_GLUE_POD_PRE_APPLY_CMD` can reference templated vars as `{queue_name}, {pod_name}, {namespace}` that will get replaced with the actual value by the Agent at execution time.
``` yaml
agentk8sglue:
# Bind a pre-defined custom 'custom-agent-role' Role with the ability to handle 'persistentvolumeclaims'
additionalRoleBindings:
- custom-agent-role
extraEnvs:
# Need this or permissions to list all namespaces
- name: CLEARML_K8S_GLUE_MONITOR_ALL_NAMESPACES
value: "0"
# Executed before applying the Task Pod. Replace the $PVC_NAME placeholder in the manifest template with the Pod name and apply it, only in a specific queue.
- name: CLEARML_K8S_GLUE_POD_PRE_APPLY_CMD
value: "[[ {queue_name} == 'pvc-test' ]] && sed 's/\\$PVC_NAME/{pod_name}/g' /mnt/yaml-manifests/pvc.yaml | kubectl apply -n {namespace} -f - || echo 'Skipping PRE_APPLY PVC creation...'"
# Executed after deleting the Task Pod. Delete the PVC.
- name: CLEARML_K8S_GLUE_POD_POST_DELETE_CMD
value: "kubectl delete pvc {pod_name} -n {namespace} || echo 'Skipping POST_DELETE PVC deletion...'"
# Define a custom code module for updating the Pod template
- name: CLEARML_K8S_GLUE_TEMPLATE_MODULE
value: "custom_code:update_template"
fileMounts:
# Mount a PVC manifest file with a templated $PVC_NAME name
- name: "pvc.yaml"
folderPath: "/mnt/yaml-manifests"
fileContent: |
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: $PVC_NAME
spec:
resources:
requests:
storage: 5Gi
volumeMode: Filesystem
storageClassName: standard
accessModes:
- ReadWriteOnce
# Custom code module for updating the Pod template
- name: "custom_code.py"
folderPath: "/root"
fileContent: |-
import json
from pprint import pformat
def update_template(queue, task_data, providers_info, template, task_config, worker, queue_name, *args, **kwargs):
if queue_name == "pvc-test":
# Set PVC_NAME as the name of the Pod
PVC_NAME = f"clearml-id-{task_data.id}"
try:
# Replace the claimName placeholder with a dynamic value
template["spec"]["volumes"][0]["persistentVolumeClaim"]["claimName"] = str(PVC_NAME)
except KeyError as ex:
raise Exception("Failed modifying template with PVC_NAME: {}".format(ex))
# Return the edited template
return {"template": template}
createQueues: true
queues:
# Define a queue with an override `volumes` and `volumeMounts` section for binding a PVC
pvc-test:
templateOverrides:
volumes:
- name: task-pvc
persistentVolumeClaim:
# PVC_NAME placeholder. This will get replaced in the custom code module.
claimName: PVC_NAME
volumeMounts:
- mountPath: "/tmp/task/"
name: task-pvc
```
Example of `custom-agent-role` Role with permissions to handle `persistentvolumeclaims`:
``` yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: custom-agent-role
rules:
- apiGroups:
- ""
resources:
- persistentvolumeclaims
verbs:
- get
- list
- watch
- create
- patch
- delete
```

View File

@ -0,0 +1,61 @@
🟡 Ready, missing link
---
TODO:
- Link: fractional GPUs
---
# Basic Deployment - Suggested GPU Operator Values
## Add the Helm Repo Locally
Add the ClearML Helm repository:
``` bash
helm repo add nvidia https://nvidia.github.io/gpu-operator
```
Update the repository locally:
``` bash
helm repo update
```
## Installation
As mentioned by NVIDIA, this configuration is needed to prevent unprivileged containers from bypassing the Kubernetes Device Plugin API.
Create a `gpu-operator.override.yaml` file with the following content:
``` yaml
toolkit:
env:
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
value: "false"
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
value: "true"
devicePlugin:
env:
- name: PASS_DEVICE_SPECS
value: "true"
- name: FAIL_ON_INIT_ERROR
value: "true"
- name: DEVICE_LIST_STRATEGY # Use volume-mounts
value: volume-mounts
- name: DEVICE_ID_STRATEGY
value: uuid
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
```
Install the gpu-operator:
``` bash
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
```
# Fractioning
For fractional GPU support, refer to the dedicated guides.
TODO link to the fractional_gpus directory page in documentation

View File

@ -0,0 +1,24 @@
🟢 Ready
---
# Multi-Node Training
The ClearML Enterprise Agent supports horizontal multi-node training executions. Here is a configuration example (in `clearml-agent-values.override.yaml`):
``` yaml
agentk8sglue:
# Cluster access is required to run multi-node Tasks
serviceAccountClusterAccess: true
multiNode:
enabled: true
createQueues: true
queues:
multi-node-example:
queueSettings:
# Defines the distribution of GPUs Tasks across multiple nodes. The format [x, y, ...] specifies the distribution of Tasks as 'x' GPUs on a node and 'y' GPUs on another node. Multiple Pods will be spawned respectively based on the lowest-common-denominator defined.
multiNode: [ 4, 2 ]
templateOverrides:
resources:
limits:
# Note you will need to use the lowest-common-denominator of the GPUs distribution defined in `queueSettings.multiNode`.
nvidia.com/gpu: 2
```

View File

@ -0,0 +1,211 @@
🟡 Ready, but missing hyperlinks (see TODOs)
---
TODO:
Control Plane:
- Link: basic k8s installation
- Link: SSO login
- Additional envs for control-plane multi-tenancy
Workers:
- Link: basic Agent installation
- Link: basic AI App Gateway installation
---
# Multi-Tenancy
## Control Plane
For installing the ClearML control-plane, follow this guide (TODO link to the basic_k8s_installation page).
Update the Server’s `clearml-values.override.yaml` with the following values:
``` yaml
apiserver:
extraEnvs:
- name: CLEARML__services__organization__features__user_management_advanced
value: "true"
- name: CLEARML__services__auth__ui_features_per_role__user__show_datasets
value: "false"
- name: CLEARML__services__auth__ui_features_per_role__user__show_orchestration
value: "false"
- name: CLEARML__services__workers__resource_usages__supervisor_company
value: "d1bd92a3b039400cbafc60a7a5b1e52b" # Default company
- name: CLEARML__secure__credentials__supervisor__role
value: "system"
- name: CLEARML__secure__credentials__supervisor__allow_login
value: "true"
- name: CLEARML__secure__credentials__supervisor__user_key
value: "<SUPERVISOR_USER_KEY>"
- name: CLEARML__secure__credentials__supervisor__user_secret
value: "<SUPERVISOR_USER_SECRET>"
- name: CLEARML__secure__credentials__supervisor__sec_groups
value: "[\"users\", \"admins\", \"queue_admins\"]"
- name: CLEARML__secure__credentials__supervisor__email
value: "\"<SUPERVISOR_USER_EMAIL>\""
- name: CLEARML__apiserver__company__unique_names
value: "true"
```
The credentials specified in `<SUPERVISOR_USER_KEY>` and `<SUPERVISOR_USER_SECRET>` can be used to log in as the supervisor user from the ClearML Web UI accessible using the URL `app.<BASE_DOMAIN>`.
Note that the `<SUPERVISOR_USER_EMAIL>` value must be explicitly quoted. To do so, put `\"` around the quoted value. Example `"\"email@example.com\""`.
You will want to configure SSO as well. For this, follow the “SSO (Identity Provider) Setup” (TODO link to the sso-login page).
### Create a Tenant
The following section will address the steps required to create a new tenant in the ClearML Control-plane server using a series of API calls.
Note that placeholders (`<PLACEHOLDER>`) in the following configuration should be substituted with a valid domain based on your installation values.
#### Create a new Tenant in the ClearML Control-plane
*Define variables to use in the next steps:*
``` bash
APISERVER_URL="https://api.<BASE_DOMAIN>"
APISERVER_KEY=<APISERVER_KEY>
APISERVER_SECRET=<APISERVER_SECRET>
```
**Note**: The apiserver key and secret should be the same as those used for installing the ClearML Enterprise server Chart.
*Create a Tenant (company):*
``` bash
curl $APISERVER_URL/system.create_company \
-H "Content-Type: application/json" \
-u $APISERVER_KEY:$APISERVER_SECRET \
-d '{"name":"<TENANT_NAME>"}'
```
The result returns the new Company ID (`<COMPANY_ID>`)
If needed, list existing tenants (companies) using:
``` bash
curl -u $APISERVER_KEY:$APISERVER_SECRET $APISERVER_URL/system.get_companies
```
*Create an Admin User for the new tenant:*
``` bash
curl $APISERVER_URL/auth.create_user \
-H "Content-Type: application/json" \
-u $APISERVER_KEY:$APISERVER_SECRET \
-d '{"name":"<ADMIN_USER_NAME>","company":"<COMPANY_ID>","email":"<ADMIN_USER_EMAIL>","role":"admin","internal":"true"}'
```
The result returns the new User ID (`<USER_ID>`)
*Create Credentials for the new Admin User:*
``` bash
curl $APISERVER_URL/auth.create_credentials \
-H "Content-Type: application/json" \
-H "X-Clearml-Impersonate-As: <USER_ID>" \
-u $APISERVER_KEY:$APISERVER_SECRET
```
The result returns a set of key and secret credentials associated with the new Admin User.
**Note**: You can use this set of credentials to set up an Agent or App Gateway for the newly created Tenant.
#### Create IDP/SSO sign-in rules
To map new users signing into the system to existing tenants, you can use one or more of the following route methods to route new users (based on their email address) to an existing tenant.
*Route an email to a tenant based on the email’s domain:*
This will instruct the server to assign any new user whose email domain matches the domain provided below to this specific tenant.
Note that providing the same domain name for multiple tenants will result in unstable behavior and should be avoided.
``` bash
curl $APISERVER_URL/login.set_domains \
-H "Content-Type: application/json" \
-H "X-Clearml-Act-As: <USER_ID>" \
-u $APISERVER_KEY:$APISERVER_SECRET \
-d '{"domains":["<USERS_EMAIL_DOMAIN>"]}'
```
`<USERS_EMAIL_DOMAIN>` is the email domain set up for users to access through SSO.
*Route specific email(s) to a tenant:*
This will instruct the server to assign any new user whose email is found in this list to this specific tenant. You can use the is_admin property to choose whether these users will be set as admins in this tenant upon login.
Note that you can create more than one list per tenant (using multiple API calls) to create one list for admin users and another for non-admin users.
Note that including the same email address in more than a single tenant’s list will result in unstable behavior and should be avoided.
``` bash
curl $APISERVER_URL/login.add_whitelist_entries \
-H "Content-Type: application/json" \
-H "X-Clearml-Act-As: <USER_ID>" \
-u $APISERVER_KEY:$APISERVER_SECRET \
-d '{"emails":["<email1>", "<email2>", ...],"is_admin":false}'
```
To remove existing email(s) from these lists, use the following API call. Note that this will not affect a user who has already logged in using one of these email addresses:
``` bash
curl $APISERVER_URL/login.remove_whitelist_entries \
-H "Content-Type: application/json" \
-H "X-Clearml-Act-As: <USER_ID>" \
-u $APISERVER_KEY:$APISERVER_SECRET \
-d '{"emails":["<email1>", "<email2>", ...]}'
```
*Get the current login routing settings:*
To get the current IDP/SSO login rule settings for this tenant:
``` bash
curl $APISERVER_URL/login.get_settings \
-H "X-Clearml-Act-As: <USER_ID>" \
-u $APISERVER_KEY:$APISERVER_SECRET
```
### Limit Features for all Users in a Groupâ€
The server’s `clearml-values.override.yaml` can control some tenants’ configurations, limiting the features available to some users or groups in the system.
Example: with the following configuration, all users in the “users” group will only have the `applications` feature enabled.
``` yaml
apiserver:
extraEnvs:
- name: CLEARML__services__auth__default_groups__users__features
value: "[\"applications\"]"
```
Available Features:
- `applications` - Viewing and running applications
- `data_management` - Working with hyper-datasets and dataviews
- `experiments` - Viewing the experiment table and launching experiments
- `queues` - Viewing the queues screen
- `queue_management` - Creating and deleting queues
- `pipelines` - Viewing/managing pipelines in the system
- `reports` - Viewing and managing reports in the system
- `show_dashboard` - Show the dashboard screen
- `show_projects` - Show the projects menu option
- `resource_dashboard` - Display the resource dashboard in the orchestration page
## Workers
Refer to the following pages for installing and configuring the ClearML Enterprise Agent (TODO link to agent_k8s_installation) and App Gateway (TODO link to app-gateway).
**Note**: Make sure to setup Agent and App Gateway using a Tenant's admin user credentials created with the Tenant creation APIs described above.
### Tenants Separation
In multi-tenant setups, you can separate the tenants’ workers in different namespaces.
Create a Kubernetes Namespace for each tenant and install a dedicated ClearML Agent and AI Application Gateway in each Namespace.
A tenant’s Agent and Gateway need to be configured with credentials created on the ClearML server by a user of the same tenant.
Additional network separation can be achieved via Kubernetes Network Policies.

View File

@ -0,0 +1,59 @@
# ClearML Presign Service
The ClearML Presign Service is a secure component for generating and redirecting pre-signed storage URLs for authenticated users.
# Prerequisites
- The ClearML Enterprise server is up and running.
- Create a set of `<ACCESS_KEY>` and `<SECRET_KEY>` credentials in the ClearML Server. The easiest way to do so is from the ClearML UI (Settings → Workspace → App Credentials → Create new credentials).
Note: Make sure that the generated keys belong to an admin user or a service user with admin privileges.
- The worker environment should be able to communicate to the ClearML Server over the same network.
# Installation
## Add the Helm Repo Locally
Add the ClearML Helm repository:
``` bash
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <HELM_REPO_TOKEN> --password <HELM_REPO_TOKEN>
```
Update the repository locally:
``` bash
helm repo update
```
## Prepare Values
Create a `presign-service.override.yaml` override file, replacing placeholders.
``` yaml
imageCredentials:
password: "<CLEARML_DOCKERHUB_TOKEN>"
clearml:
apiServerUrlReference: "<CLEARML_API_SERVER_URL>"
apiKey: "<ACCESS_KEY>"
apiSecret: "<SECRET_KEY>"
ingress:
enabled: true
hostName: "<PRESIGN_SERVICE_URL>"
```
## Install
Install the clearml-presign-service helm chart in the same namespace as the ClearML Enterprise server:
``` bash
helm install -n clearml clearml-presign-service clearml-enterprise/clearml-presign-service -f presign-service.override.yaml
```
## Configure ClearML Enterprise Server
After installing, edit the ClearML Enterprise `clearml-values.override.yaml` file adding an extra environment variable to the apiserver component as follows, making sure to replace the `<PRESIGN_SERVICE_URL>` placeholder, then perform a helm upgrade.
``` yaml
apiserver:
extraEnvs:
- name: CLEARML__SERVICES__SYSTEM__COMPANY__DEFAULT__SERVICES
value: "[{\"type\":\"presign\",\"url\":\"https://<PRESIGN_SERVICE_URL>\",\"use_fallback\":\"false\",\"match_sets\":[{\"rules\":[{\"field\":\"\",\"obj_type\":\"\",\"regex\":\"^s3://\"}]}]}]"
```

View File

@ -0,0 +1,202 @@
# ClearML Tenant with Self Signed Certificates
This guide covers the configuration to support SSL Custom certificates for the following components:
- ClearML Enterprise AI Application Gateway
- ClearML Enterprise Agent
## AI Application Gateway
Add the following section in the `clearml-app-gateway-values.override.yaml` file:
``` yaml
# -- Custom certificates
customCertificates:
# -- Override system crt certificate bundle. Mutual exclusive with extraCerts.
overrideCaCertificatesCrt:
# -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt.
extraCerts:
- alias: certificateName
pem: |
-----BEGIN CERTIFICATE-----
###
-----END CERTIFICATE-----
```
In the section, there are two options:
- Replace the whole `ca-certificates.crt` file
- Add extra certificates to the existing `ca-certificates.crt`
Let’s see them in detail.
### Replace the whole `ca-certificates.crt` file
If you need to replace the whole ca-bundle you can attach a concatenation of all your valid CA in a `pem` format like they are stored in a standard `ca-certificates.crt`.
``` yaml
# -- Custom certificates
customCertificates:
# -- Override system crt certificate bundle. Mutual exclusive with extraCerts.
overrideCaCertificatesCrt: |
-----BEGIN CERTIFICATE-----
### CERT 1
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
### CERT 2
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
### CERT 3
-----END CERTIFICATE-----
...
```
### Add extra certificates to the existing `ca-certificates.crt`
To add extra certificates to the standard provided CA available in the container you can define any specific certificate in the list.
Ensure to provide different aliases.
``` yaml
# -- Custom certificates
customCertificates:
# -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt.
extraCerts:
- alias: certificate-name-1
pem: |
-----BEGIN CERTIFICATE-----
###
-----END CERTIFICATE-----
- alias: certificate-name-2
pem: |
-----BEGIN CERTIFICATE-----
###
-----END CERTIFICATE-----
```
### Apply changes
After applying the changes ensure to run the update command:
``` bash
helm upgrade -i <RELEASE_NAME> -n <WORKLOAD_NAMESPACE> clearml-enterprise/clearml-enterprise-app-gateway --version <CHART_VERSION> -f clearml-app-gateway-values.override.yaml
```
## ClearML Enterprise Agent
Add the following section in the `clearml-agent-values.override.yaml` file:
``` yaml
# -- Custom certificates
customCertificates:
# -- Override system crt certificate bundle. Mutual exclusive with extraCerts.
overrideCaCertificatesCrt:
# -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt.
extraCerts:
- alias: certificateName
pem: |
-----BEGIN CERTIFICATE-----
###
-----END CERTIFICATE-----
```
In the section, there are two options:
- Replace the whole `ca-certificates.crt` file
- Add extra certificates to the existing `ca-certificates.crt`
Let’s see them in detail.
### Replace the whole `ca-certificates.crt` file
If you need to replace the whole ca-bundle you can attach a concatenation of all your valid CA in a `pem` format like they are stored in a standard `ca-certificates.crt`.
``` yaml
# -- Custom certificates
customCertificates:
# -- Override system crt certificate bundle. Mutual exclusive with extraCerts.
overrideCaCertificatesCrt: |
-----BEGIN CERTIFICATE-----
### CERT 1
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
### CERT 2
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
### CERT 3
-----END CERTIFICATE-----
...
```
### Add extra certificates to the existing `ca-certificates.crt`
To add extra certificates to the standard provided CA available in the container you can define any specific certificate in the list.
Ensure to provide different aliases.
``` yaml
# -- Custom certificates
customCertificates:
# -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt.
extraCerts:
- alias: certificate-name-1
pem: |
-----BEGIN CERTIFICATE-----
###
-----END CERTIFICATE-----
- alias: certificate-name-2
pem: |
-----BEGIN CERTIFICATE-----
###
-----END CERTIFICATE-----
```
### Add certificates to Tasks
``` yaml
agentk8sglue:
basePodTemplate:
initContainers:
- command:
- /bin/sh
- -c
- update-ca-certificates
image: allegroai/clearml-enterprise-agent-k8s-base:<AGENT-VERSION-AVAIABLE-ON-REPO>
imagePullPolicy: IfNotPresent
name: init-task
volumeMounts:
- name: etc-ssl-certs
mountPath: "/etc/ssl/certs"
- name: clearml-extra-ca-certs
mountPath: "/usr/local/share/ca-certificates"
env:
- name: REQUESTS_CA_BUNDLE
value: /etc/ssl/certs/ca-certificates.crt
volumeMounts:
- name: etc-ssl-certs
mountPath: "/etc/ssl/certs"
volumes:
- name: etc-ssl-certs
emptyDir: {}
- name: clearml-extra-ca-certs
projected:
defaultMode: 420
sources:
# LIST HERE CONFIGMAPS CREATED BY THE AGENT CHART, THE CARDINALITY DEPENDS ON THE NUMBER OF CERTS PROVIDED.
- configMap:
name: clearml-agent-clearml-enterprise-agent-custom-ca-0
- configMap:
name: clearml-agent-clearml-enterprise-agent-custom-ca-1
```
Please note the `clearml-extra-ca-certs` volume, ensure to add each configMap created by the agent.
The name can differ based on the release name used during the installation.
### Apply changes
After applying the changes ensure to run the update command:
``` bash
helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml
```

View File

@ -0,0 +1,60 @@
# SSO (Identity Provider) Setup
ClearML Enterprise Server supports various SSO options, values configurations can be set in `clearml-values.override.yaml`.
The following are a few examples. Some other supported providers are Auth0, Keycloak, Okta, Azure, Google, Cognito.
## Auth0
``` yaml
apiserver:
extraEnvs:
- name: CLEARML__secure__login__sso__oauth_client__auth0__client_id
value: "<AUTH0_CLIENT_ID>"
- name: CLEARML__secure__login__sso__oauth_client__auth0__client_secret
value: "<AUTH0_CLIENT_SECRET>"
- name: CLEARML__services__login__sso__oauth_client__auth0__base_url
value: "<AUTH0_BASE_URL>"
- name: CLEARML__services__login__sso__oauth_client__auth0__authorize_url
value: "<AUTH0_AUTHORIZE_URL>"
- name: CLEARML__services__login__sso__oauth_client__auth0__access_token_url
value: "<AUTH0_ACCESS_TOKEN_URL>"
- name: CLEARML__services__login__sso__oauth_client__auth0__audience
value: "<AUTH0_AUDIENCE>"
```
## Keycloak
``` yaml
apiserver:
extraEnvs:
- name: CLEARML__secure__login__sso__oauth_client__keycloak__client_id
value: "<KC_CLIENT_ID>"
- name: CLEARML__secure__login__sso__oauth_client__keycloak__client_secret
value: "<KC_SECRET_ID>"
- name: CLEARML__services__login__sso__oauth_client__keycloak__base_url
value: "<KC_URL>/realms/<REALM_NAME>/"
- name: CLEARML__services__login__sso__oauth_client__keycloak__authorize_url
value: "<KC_URL>/realms/<REALM_NAME>/protocol/openid-connect/auth"
- name: CLEARML__services__login__sso__oauth_client__keycloak__access_token_url
value: "<KC_URL>/realms/<REALM_NAME>/protocol/openid-connect/token"
- name: CLEARML__services__login__sso__oauth_client__keycloak__idp_logout
value: "true"
```
### Note if using Groups Mapping
When configuring the OpenID client for ClearML:
- Navigate to the Client Scopes tab.
- Click on the first row <clearml client>-dedicated.
- Click "Add Mapper" → "By configuration" and then select the "Group membership" option.
- In the opened dialog, enter the name "groups" and the Token claim name "groups".
- Uncheck the "Full group path" option and save the mapper.
To validate yourself:
- Return to the Client Details → Client scope tab.
- Go to the Evaluate sub-tab and select a user who has any group memberships.
- On the right side, navigate to the Generated ID token and then to Generated User Info.
- Inspect that in both cases, you can see the group's claim in the displayed user data.

View File

@ -0,0 +1,131 @@
🟢 Ready
---
# ClearML Dynamic MIG Operator (CDMO)
Enables dynamic MIG GPU configurations.
# Installation
## Requirements
Install the official NVIDIA `gpu-operator` using Helm with one of the following configurations.
Add and update the Nvidia Helm repo:
``` bash
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
```
Create a `gpu-operator.override.yaml` file with the following content:
``` yaml
migManager:
enabled: false
mig:
strategy: mixed
toolkit:
env:
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
value: "false"
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
value: "true"
devicePlugin:
env:
- name: PASS_DEVICE_SPECS
value: "true"
- name: FAIL_ON_INIT_ERROR
value: "true"
- name: DEVICE_LIST_STRATEGY # Use volume-mounts
value: volume-mounts
- name: DEVICE_ID_STRATEGY
value: uuid
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
```
Install the official NVIDIA `gpu-operator` using Helm with the previous configuration:
``` bash
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
```
## Install
Create a `cdmo-values.override.yaml` file with the following content:
``` yaml
imageCredentials:
password: "<CLEARML_DOCKERHUB_TOKEN>"
```
Install the CDMO operator Helm Chart using the previous override file:
``` bash
helm install -n cdmo cdmo clearml-enterprise/clearml-dynamic-mig-operator --create-namespace -f cdmo-values.override.yaml
```
Enable the NVIDIA MIG support on your cluster by running the following command on all Nodes with a MIG-supported GPU (run it for each GPU `<GPU_ID>` you have on the Host):
``` bash
nvidia-smi -mig 1
```
**NOTE**: The node might need to be rebooted if reported by the result of the previous command.
**NOTE**: For convenience, this command can be issued from inside the nvidia-device-plugin-daemonset Pod running on the related node.
Any MIG-enabled GPU node `<NODE_NAME>` from the last point must be labeled accordingly as follows:
``` bash
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning=mig"
```
# Unconfigure MIGs
For disabling MIG, follow these steps in order:
1. Ensure there are no more running workflows requesting any form of GPU on the Node(s) before re-configuring it.
2. Remove the CDMO label from the target Node(s) to disable the dynamic MIG reconfiguration.
``` bash
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning-"
```
3. Execute a shell into the `device-plugin-daemonset` Pod instance running on the target Node(s) and execute the following commands in order:
``` bash
nvidia-smi mig -dci
nvidia-smi mig -dgi
nvidia-smi -mig 0
```
4. Edit the `gpu-operator.override.yaml` file to have a standard configuration for full GPUs and upgrade the `gpu-operator`:
``` yaml
toolkit:
env:
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
value: "false"
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
value: "true"
devicePlugin:
env:
- name: PASS_DEVICE_SPECS
value: "true"
- name: FAIL_ON_INIT_ERROR
value: "true"
- name: DEVICE_LIST_STRATEGY # Use volume-mounts
value: volume-mounts
- name: DEVICE_ID_STRATEGY
value: uuid
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
```

View File

@ -0,0 +1,129 @@
---
title: Install CDMO and CFGI on the same Cluster
---
In clusters with multiple nodes with different GPU types, the `gpu-operator` can be used to manage different devices and
fractioning modes.
## Configuring the NVIDIA GPU Operator
The NVIDIA `gpu-operator` allows defining multiple configurations for the Device Plugin.
The following YAML values define two usable configs as "mig" and "ts" (time-slicing).
``` yaml
migManager:
enabled: false
mig:
strategy: mixed
toolkit:
env:
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
value: "false"
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
value: "true"
devicePlugin:
enabled: true
env:
- name: PASS_DEVICE_SPECS
value: "true"
- name: FAIL_ON_INIT_ERROR
value: "true"
- name: DEVICE_LIST_STRATEGY
value: volume-mounts
- name: DEVICE_ID_STRATEGY
value: uuid
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
config:
name: device-plugin-config
create: true
default: "all-disabled"
data:
all-disabled: |-
version: v1
flags:
migStrategy: none
ts: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
# Edit the following configuration as needed, adding as many GPU indices as many cards are installed on the Host.
resources:
- name: nvidia.com/gpu
rename: nvidia.com/gpu-0
devices:
- "0"
replicas: 8
mig: |-
version: v1
flags:
migStrategy: mixed
```
## Usage
Previously defined configurations need to be applied to Kubernetes nodes using labels. After a label is added to a node,
the NVIDIA `device-plugin` will automatically reconfigure it.
The following is an example using the previous configuration.
* Apply the MIG `mig` config:
``` bash
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=mig
```
* Apply the time slicing `ts` config:
``` bash
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=ts
```
* Apply the vanilla full GPUs `all-disabled` config:
``` bash
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled
```
## Install CDMO and CFGI
After configuring the NVIDIA `gpu-operator` and labeling nodes, proceed with the standard [CDMO](cdmo.md) and [CFGI](cfgi.md)
installation.
## Disabling
### Time Slicing
To toggle between time slicing and vanilla full GPUs, simply toggle the label value between `ts` and `all-disabled` using
the `--overwrite` flag in kubectl.
### MIG
To disable MIG, follow these steps:
1. Ensure there are no more running workflows requesting any form of GPU on the Node(s) before re-configuring it.
2. Remove the CDMO label from the target Node(s) to disable the dynamic MIG reconfiguration.
``` bash
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning-"
```
3. Execute a shell into the `device-plugin-daemonset` Pod instance running on the target Node(s) and execute the following commands:
``` bash
nvidia-smi mig -dci
nvidia-smi mig -dgi
nvidia-smi -mig 0
```
4. Label the target node to disable MIG.
``` bash
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled --overwrite
```

View File

@ -0,0 +1,300 @@
---
title: ClearML Fractional GPU Injector (CFGI)
---
The **ClearML Enterprise Fractional GPU Injector** (CFGI) allows AI workloads to run on Kubernetes using non-MIG GPU
fractions, optimizing both hardware utilization and performance.
## Installation
### Add the Local ClearML Helm Repository
``` bash
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <GITHUB_TOKEN> --password <GITHUB_TOKEN>
helm repo update
```
### Requirements
* Install the official NVIDIA `gpu-operator` using Helm with one of the following configurations.
* The number of slices must be 8.
* Add and update the Nvidia Helm repo:
``` bash
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
```
#### GPU Operator Configuration
##### For CFGI Version >= 1.3.0
1. Create a docker-registry secret named `clearml-dockerhub-access` in the `gpu-operator` Namespace, making sure to replace your `<CLEARML_DOCKERHUB_TOKEN>`:
```bash
kubectl create secret -n gpu-operator docker-registry clearml-dockerhub-access \
--docker-server=docker.io \
--docker-username=allegroaienterprise \
--docker-password="<CLEARML_DOCKERHUB_TOKEN>" \
--docker-email=""
```
1. Create a `gpu-operator.override.yaml` file as follows:
* Set `devicePlugin.repository` to `docker.io/clearml`
* Configure `devicePlugin.config.data.renamed-resources.sharing.timeSlicing.resources` for each GPU index on the host
* Use `nvidia.com/gpu-<INDEX>` format for the `rename` field, and set `replicas` to `8`.
```yaml
gfd:
imagePullSecrets:
- "clearml-dockerhub-access"
toolkit:
env:
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
value: "false"
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
value: "true"
devicePlugin:
repository: docker.io/clearml
image: k8s-device-plugin
version: v0.17.1-gpu-card-selection
imagePullPolicy: Always
imagePullSecrets:
- "clearml-dockerhub-access"
env:
- name: PASS_DEVICE_SPECS
value: "true"
- name: FAIL_ON_INIT_ERROR
value: "true"
- name: DEVICE_LIST_STRATEGY # Use volume-mounts
value: volume-mounts
- name: DEVICE_ID_STRATEGY
value: uuid
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
config:
name: device-plugin-config
create: true
default: "renamed-resources"
data:
renamed-resources: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
# Edit the following configuration as needed, adding as many GPU indices as many cards are installed on the Host.
resources:
- name: nvidia.com/gpu
rename: nvidia.com/gpu-0
devices:
- "0"
replicas: 8
- name: nvidia.com/gpu
rename: nvidia.com/gpu-1
devices:
- "1"
replicas: 8
```
##### For CFGI version < 1.3.0 (Legacy GPU Operator)
Create a `gpu-operator.override.yaml` file:
``` yaml
toolkit:
env:
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
value: "false"
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
value: "true"
devicePlugin:
env:
- name: PASS_DEVICE_SPECS
value: "true"
- name: FAIL_ON_INIT_ERROR
value: "true"
- name: DEVICE_LIST_STRATEGY # Use volume-mounts
value: volume-mounts
- name: DEVICE_ID_STRATEGY
value: uuid
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
config:
name: device-plugin-config
create: true
default: "any"
data:
any: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
resources:
- name: nvidia.com/gpu
replicas: 8
```
### Install
Install the nvidia `gpu-operator` using the previously created `gpu-operator.override.yaml` override file:
```bash
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
```
Create a `cfgi-values.override.yaml` file with the following content:
```yaml
imageCredentials:
password: "<CLEARML_DOCKERHUB_TOKEN>"
```
Install the CFGI Helm Chart using the previous override file:
```bash
helm install -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector --create-namespace -f cfgi-values.override.yaml
```
## Usage
Fractional GPU Injector will inject CUDA files on pods that have the following label:
```yaml
labels:
clearml-injector/fraction: "<GPU_FRACTION_VALUE>"
```
where `"<GPU_FRACTION_VALUE>"` must be equal one of following values:
* "0.0625" (1/16th)
* "0.125" (1/8th)
* "0.250"
* "0.375"
* "0.500"
* "0.625"
* "0.750"
* "0.875"
Valid values for `"<GPU_FRACTION_VALUE>"` include integer representation of GPUs such as `1.000` or `2` or `2.0` etc.
### ClearML Enterprise Agent Configuration
In order to specify resource allocation when using the CFGI, the following values configuration should be set in `clearml-agent-values.override.yaml`.
The label `clearml-injector/fraction: "<GPU_FRACTION_VALUE>"` is required in order to specify a fraction of the GPU to be
assigned to the pod started for a task pulled from the respective queue.
#### CFGI Version >= 1.3.0
Starting from version 1.3.0, there is no need to specify the resources field.
``` yaml
agentk8sglue:
createQueues: true
queues:
gpu-fraction-1_000:
templateOverrides:
labels:
clearml-injector/fraction: "1.000"
gpu-fraction-0_500:
templateOverrides:
labels:
clearml-injector/fraction: "0.500"
gpu-fraction-0_250:
templateOverrides:
labels:
clearml-injector/fraction: "0.250"
gpu-fraction-0_125:
templateOverrides:
labels:
clearml-injector/fraction: "0.125"
```
#### CFGI Version < 1.3.0
```yaml
agentk8sglue:
createQueues: true
queues:
gpu-fraction-1_000:
templateOverrides:
resources:
limits:
nvidia.com/gpu: 8
gpu-fraction-0_500:
templateOverrides:
labels:
clearml-injector/fraction: "0.500"
resources:
limits:
nvidia.com/gpu: 4
gpu-fraction-0_250:
templateOverrides:
labels:
clearml-injector/fraction: "0.250"
resources:
limits:
nvidia.com/gpu: 2
gpu-fraction-0_125:
templateOverrides:
labels:
clearml-injector/fraction: "0.125"
resources:
limits:
nvidia.com/gpu: 1
```
## Upgrading Chart
### Upgrades / Values Upgrades
To update to the latest version of this chart:
```bash
helm repo update
helm upgrade -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector
```
To change values on an existing installation:
```bash
helm upgrade -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector -f cfgi-values.override.yaml
```
## Disable Fractions
To toggle between timeSlicing and vanilla full GPUs, remove the `devicePlugin.config` section from the `gpu-operator.override.yaml`
file and upgrade the `gpu-operator`:
```yaml
toolkit:
env:
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
value: "false"
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
value: "true"
devicePlugin:
env:
- name: PASS_DEVICE_SPECS
value: "true"
- name: FAIL_ON_INIT_ERROR
value: "true"
- name: DEVICE_LIST_STRATEGY # Use volume-mounts
value: volume-mounts
- name: DEVICE_ID_STRATEGY
value: uuid
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
```