This commit is contained in:
revital 2025-05-15 14:55:35 +03:00
parent d94d777e55
commit aaa3851de3
10 changed files with 139 additions and 147 deletions

View File

@ -6,8 +6,8 @@ The ClearML Agent enables scheduling and executing distributed experiments on a
## Prerequisites ## Prerequisites
- A ClearML Enterprise server is up and running. - A running [ClearML Enterprise Server](k8s.md)
- Generate a set of `<ACCESS_KEY>` and `<SECRET_KEY>` credentials in the ClearML Server. The easiest way is via - API credentials (`<ACCESS_KEY>` and `<SECRET_KEY>`) generated via
the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). For more information, see [ClearML API Credentials](../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials). the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). For more information, see [ClearML API Credentials](../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials).
:::note :::note
@ -15,7 +15,7 @@ The ClearML Agent enables scheduling and executing distributed experiments on a
::: :::
- The worker environment must be able to access the ClearML Server over the same network. - The worker environment must be able to access the ClearML Server over the same network.
- * Helm token to access `clearml-enterprise` helm-chart repo - Helm token to access `clearml-enterprise` Helm chart repo
## Installation ## Installation
@ -36,9 +36,9 @@ helm repo update
Create a `clearml-agent-values.override.yaml` file with the following content: Create a `clearml-agent-values.override.yaml` file with the following content:
:::note :::note
Replace the `<ACCESS_KEY>` and `<SECRET_KEY>` with the admin credentials Replace the `<ACCESS_KEY>` and `<SECRET_KEY>`with the API credentials you generated earlier.
you created earlier. Set `<api|file|web>ServerUrlReference` to the relevant URLs of your ClearML Set the `<api|file|web>ServerUrlReference` fields to match your ClearML
Server installation. Server URLs.
::: :::
```yaml ```yaml
@ -60,7 +60,7 @@ agentk8sglue:
### Install the Chart ### Install the Chart
Install the ClearML Enterprise Agent Helm chart using the previous values override file: Install the ClearML Enterprise Agent Helm chart:
```bash ```bash
helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml
@ -68,7 +68,7 @@ helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-e
## Additional Configuration Options ## Additional Configuration Options
To view all configurable options for the Helm chart, run the following command: To view available configuration options for the Helm chart, run the following command:
```bash ```bash
helm show readme clearml-enterprise/clearml-enterprise-agent helm show readme clearml-enterprise/clearml-enterprise-agent
@ -76,7 +76,7 @@ helm show readme clearml-enterprise/clearml-enterprise-agent
helm show values clearml-enterprise/clearml-enterprise-agent helm show values clearml-enterprise/clearml-enterprise-agent
``` ```
### Set GPU Availability in Orchestration Dashboard ### Reporting GPU Availability to Orchestration Dashboard
To show GPU availability in the [Orchestration Dashboard](../../webapp/webapp_orchestration_dash.md), explicitly set the number of GPUs: To show GPU availability in the [Orchestration Dashboard](../../webapp/webapp_orchestration_dash.md), explicitly set the number of GPUs:
@ -88,25 +88,22 @@ agentk8sglue:
### Queues ### Queues
The ClearML Agent monitors ClearML queues and pulls tasks that are scheduled for execution. The ClearML Agent monitors [ClearML queues](../../fundamentals/agents_and_queues.md) and pulls tasks that are
scheduled for execution.
A single agent can monitor multiple queues. By default, the queues share a base pod template (`agentk8sglue.basePodTemplate`) A single agent can monitor multiple queues. By default, all queues share a base pod template (`agentk8sglue.basePodTemplate`)
used when submitting a task to Kubernetes after it has been extracted from the queue. used when launching tasks on Kubernetes after it has been pulled from the queue.
Each queue can be configured with dedicated Pod template spec override (`templateOverrides`). This way queue definitions Each queue can be configured to override the base pod template with its own settings with a `templateOverrides` queue template.
can be tailored to different use cases. This way queue definitions can be tailored to different use cases.
The following are a few examples of agent queue templates: The following are a few examples of agent queue templates:
#### Example: GPU Queues #### Example: GPU Queues
To support GPU queues, you must deploy the NVIDIA GPU Operator on your Kubernetes cluster. For more information, see [GPU Operator](extra_configs/gpu_operator.md).
GPU queue support requires deploying the NVIDIA GPU Operator on your Kubernetes cluster. ```yaml
For more information, see [GPU Operator](extra_configs/gpu_operator.md).
``` yaml
agentk8sglue: agentk8sglue:
createQueues: true createQueues: true
queues: queues:
@ -122,8 +119,9 @@ agentk8sglue:
nvidia.com/gpu: 2 nvidia.com/gpu: 2
``` ```
#### Example: Overriding Pod Templates per Queue #### Example: Custom Pod Template per Queue
This example demonstrates how to override the base pod template definitions on a per-queue basis.
In this example: In this example:
- The `red` queue inherits both the label `team=red` and the 1Gi memory limit from the `basePodTemplate` section. - The `red` queue inherits both the label `team=red` and the 1Gi memory limit from the `basePodTemplate` section.
@ -167,5 +165,5 @@ agentk8sglue:
## Next Steps ## Next Steps
Once the agent is up and running, proceed with deploying the[ ClearML Enterprise App Gateway](appgw_install_k8s.md). Once the agent is up and running, proceed with deploying the [ClearML Enterprise App Gateway](appgw_install_k8s.md).

View File

@ -2,18 +2,15 @@
title: Dynamically Edit Task Pod Template title: Dynamically Edit Task Pod Template
--- ---
The ClearML Enterprise Agent supports defining custom Python code to modify a task's Pod template before it is applied ClearML Agent allows you to inject custom Python code to dynamically modify the Kubernetes Pod template before applying it.
to Kubernetes.
This enables dynamic customization of Task Pod manifests in the context of a ClearML Enterprise Agent, which is useful
for injecting values or changing configurations based on runtime context.
## Agent Configuration ## Agent Configuration
The `CLEARML_K8S_GLUE_TEMPLATE_MODULE` environment variable defines the Python module and function inside that The `CLEARML_K8S_GLUE_TEMPLATE_MODULE` environment variable defines the Python module and function inside that
module that the ClearML Enterprise Agent should invoke before applying a Task Pod template. module to be invoked by the agent before applying a task pod template.
The Agent will run this code in its own context, pass arguments (including the actual template) to the function, and use The agent will run this code in its own context, pass arguments (including the actual template) to the function, and use
the returned template to create the final Task Pod in Kubernetes. the returned template to create the final Task Pod in Kubernetes.
Arguments passed to the function include: Arguments passed to the function include:
@ -60,13 +57,13 @@ agentk8sglue:
``` ```
:::note notes :::note notes
* Make sure to include `*args, **kwargs` at the end of the function's argument list and to only use keyword arguments. * Always include `*args, **kwargs` at the end of the function's argument list and only use keyword arguments.
This is needed to maintain backward compatibility. This is needed to maintain backward compatibility.
* Custom code modules can be included as a file in the pod's container, and the environment variable can be used to * Custom code modules can be included as a file in the pod's container, and the environment variable can be used to
point to the file and entry point. point to the file and entry point.
* When defining a custom code module, by default the Agent will start watching pods in all namespaces * When defining a custom code module, by default the agent will start watching pods in all namespaces
across the cluster. If you do not intend to give a `ClusterRole` permission, make sure to set the across the cluster. If you do not intend to give a `ClusterRole` permission, make sure to set the
`CLEARML_K8S_GLUE_MONITOR_ALL_NAMESPACES` env to `"0"` to prevent the Agent to try listing pods in all namespaces. `CLEARML_K8S_GLUE_MONITOR_ALL_NAMESPACES` env to `"0"` to prevent the Agent to try listing pods in all namespaces.
Instead, set it to `"1"` if namespace-related changes are needed in the code. Instead, set it to `"1"` if namespace-related changes are needed in the code.
@ -80,13 +77,13 @@ agentk8sglue:
To customize the bash startup scripts instead of the pod spec, use: To customize the bash startup scripts instead of the pod spec, use:
```yaml ```yaml
agentk8sglue: agentk8sglue:
# -- Custom Bash script for the Agent pod ran by Glue Agent # -- Custom Bash script for the Agent pod ran by Glue Agent
customBashScript: "" customBashScript: ""
# -- Custom Bash script for the Task Pods ran by Glue Agent # -- Custom Bash script for the Task Pods ran by Glue Agent
containerCustomBashScript: "" containerCustomBashScript: ""
``` ```
## Examples ## Examples
@ -167,7 +164,7 @@ agentk8sglue:
### Example: Bind PVC Resource to Task Pod ### Example: Bind PVC Resource to Task Pod
In this example, a PVC is created and attached to every Pod created from a dedicated queue, then deleted afterwards. In this example, a PVC is created and attached to every pod created from a dedicated queue, then it is deleted.
Key points: Key points:

View File

@ -2,10 +2,12 @@
title: Basic Deployment - Suggested GPU Operator Values title: Basic Deployment - Suggested GPU Operator Values
--- ---
This guide provides recommended configuration values for deploying the NVIDIA GPU Operator alongside ClearML Enterprise.
## Add the Helm Repo Locally ## Add the Helm Repo Locally
Add the NVIDIA GPU Operator Helm repository: Add the NVIDIA GPU Operator Helm repository:
```bash ```bash
helm repo add nvidia https://nvidia.github.io/gpu-operator helm repo add nvidia https://nvidia.github.io/gpu-operator
``` ```
@ -17,10 +19,8 @@ helm repo update
## Installation ## Installation
To prevent unprivileged containers from bypassing the Kubernetes Device Plugin API, configure the GPU operator with the To prevent unprivileged containers from bypassing the Kubernetes Device Plugin API, configure the GPU operator
following override values. using the following `gpu-operator.override.yaml` file:
Create a `gpu-operator.override.yaml` file with the following content:
```yaml ```yaml
toolkit: toolkit:
@ -53,7 +53,7 @@ helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace
## Fractional GPU Support ## Fractional GPU Support
For support with fractional GPUs, refer to the dedicated guides: To enable fractional GPU allocation or manage mixed GPU configurations, refer to the following guides:
* [ClearML Dynamic MIG Operator](../fractional_gpus/cdmo.md) (CDMO) Dynamically configures MIG GPUs on supported devices. * [ClearML Dynamic MIG Operator](../fractional_gpus/cdmo.md) (CDMO) Dynamically configures MIG GPUs on supported devices.
* [ClearML Enterprise Fractional GPU Injector](../fractional_gpus/cfgi.md) (CFGI) Enables fractional (non-MIG) GPU * [ClearML Enterprise Fractional GPU Injector](../fractional_gpus/cfgi.md) (CFGI) Enables fractional (non-MIG) GPU
allocation for better hardware utilization and workload distribution in Kubernetes. allocation for better hardware utilization and workload distribution in Kubernetes.

View File

@ -2,7 +2,8 @@
title: Multi-Node Training title: Multi-Node Training
--- ---
The ClearML Enterprise Agent supports horizontal multi-node training--running a single Task across multiple Pods on different nodes. The ClearML Enterprise Agent supports horizontal multi-node training, allowing a single Task to run across multiple pods
on different nodes.
Below is a configuration example using `clearml-agent-values.override.yaml`: Below is a configuration example using `clearml-agent-values.override.yaml`:

View File

@ -7,16 +7,16 @@ users, enabling direct access to cloud-hosted data (e.g., S3) without exposing c
## Prerequisites ## Prerequisites
- A ClearML Enterprise server is up and running. - A ClearML Enterprise Server is up and running.
- Generate `<ACCESS_KEY>` and `<SECRET_KEY>` credentials in the ClearML Server. The easiest way is via the ClearML UI - API credentials (`<ACCESS_KEY>` and `<SECRET_KEY>`) generated via
(**Settings > Workspace > App Credentials > Create new credentials**). the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). For more information, see [ClearML API Credentials](../../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials).
:::note :::note
Make sure these credentials belong to an admin user or a service user with admin privileges. Make sure these credentials belong to an admin user or a service user with admin privileges.
::: :::
- The worker environment must be able to access the ClearML Server over the same network. - The worker environment must be able to access the ClearML Server over the same network.
- Token to access `clearml-enterprise` Helm chart repo
## Installation ## Installation
@ -50,7 +50,7 @@ ingress:
### Deploy the Helm Chart ### Deploy the Helm Chart
Install the `clearml-presign-service` helm chart in the same namespace as the ClearML Enterprise server: Install the `clearml-presign-service` Helm chart in the same namespace as the ClearML Enterprise server:
```bash ```bash
helm install -n clearml clearml-presign-service clearml-enterprise/clearml-presign-service -f presign-service.override.yaml helm install -n clearml clearml-presign-service clearml-enterprise/clearml-presign-service -f presign-service.override.yaml

View File

@ -2,10 +2,8 @@
title: ClearML Tenant with Self Signed Certificates title: ClearML Tenant with Self Signed Certificates
--- ---
This guide covers the configuration to support SSL Custom certificates for the following components: This guide covers how to configure the [AI Application Gateway](#ai-application-gateway) and [ClearML Agent](#clearml-agent)
to use self-signed or custom SSL certificates.
- ClearML Enterprise AI Application Gateway
- ClearML Enterprise Agent
## AI Application Gateway ## AI Application Gateway
@ -25,15 +23,15 @@ customCertificates:
-----END CERTIFICATE----- -----END CERTIFICATE-----
``` ```
In this section, there are two options: You have two configuration options:
- [**Replace**](#replace-the-whole-ca-certificatescrt-file) the entire `ca-certificates.crt` file - [**Replace**](#replace-entire-ca-certificatescrt-file) the entire `ca-certificates.crt` file
- [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt) extra certificates to the existing `ca-certificates.crt` - [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt) extra certificates to the existing `ca-certificates.crt`
### Replace Entire `ca-certificates.crt` File ### Replace Entire `ca-certificates.crt` File
To replace the whole ca-bundle, you can attach a concatenation of all your valid CA in a `pem` format as To replace the whole ca-bundle, provide a concatenated list of all trusted CA certificates in `pem` format as
they are stored in a standard `ca-certificates.crt`. they are stored in a standard `ca-certificates.crt`.
```yaml ```yaml
@ -55,7 +53,7 @@ customCertificates:
### Append Extra Certificates to the Existing `ca-certificates.crt` ### Append Extra Certificates to the Existing `ca-certificates.crt`
You can add certificates to the existing CA bundle. Ensure each certificate has a unique `alias`. You can add certificates to the existing CA bundle. Each certificate must have a unique `alias`.
```yaml ```yaml
# -- Custom certificates # -- Custom certificates
@ -82,9 +80,9 @@ To apply the changes, run the update command:
helm upgrade -i <RELEASE_NAME> -n <WORKLOAD_NAMESPACE> clearml-enterprise/clearml-enterprise-app-gateway --version <CHART_VERSION> -f clearml-app-gateway-values.override.yaml helm upgrade -i <RELEASE_NAME> -n <WORKLOAD_NAMESPACE> clearml-enterprise/clearml-enterprise-app-gateway --version <CHART_VERSION> -f clearml-app-gateway-values.override.yaml
``` ```
## ClearML Enterprise Agent ## ClearML Agent
For the Agent, configure certificates in the `clearml-agent-values.override.yaml` file: For the ClearML Agent, configure certificates in the `clearml-agent-values.override.yaml` file:
```yaml ```yaml
# -- Custom certificates # -- Custom certificates
@ -100,17 +98,18 @@ customCertificates:
-----END CERTIFICATE----- -----END CERTIFICATE-----
``` ```
In the section, there are two options: You have two configuration options:
- [**Replace**](#replace-the-whole-ca-certificatescrt-file) the entire `ca-certificates.crt` file - [**Replace**](#replace-entire-ca-certificatescrt-file-1) the entire `ca-certificates.crt` file
- [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt) extra certificates to the existing `ca-certificates.crt` - [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt-1) extra certificates to the existing `ca-certificates.crt`
### Replace Entire `ca-certificates.crt` File ### Replace Entire `ca-certificates.crt` File
If you need to replace the whole ca-bundle you can attach a concatenation of all your valid CA in a `pem` format like To replace the whole ca-bundle, provide a concatenated list of all trusted CA certificates in `pem` format as
they are stored in a standard `ca-certificates.crt`. they are stored in a standard `ca-certificates.crt`.
```yaml ```yaml
# -- Custom certificates # -- Custom certificates
customCertificates: customCertificates:
@ -130,7 +129,7 @@ customCertificates:
### Append Extra Certificates to the Existing `ca-certificates.crt` ### Append Extra Certificates to the Existing `ca-certificates.crt`
You can add certificates to the existing CA bundle. Ensure each certificate has a unique `alias`. You can add certificates to the existing CA bundle. Each certificate must have a unique `alias`.
```yaml ```yaml
# -- Custom certificates # -- Custom certificates
@ -151,7 +150,7 @@ customCertificates:
### Add Certificates to Task Pods ### Add Certificates to Task Pods
If your workloads need access to these certificates (e.g., for HTTPS requests), configure the agent to inject them into Pods: If your workloads need access to these certificates (e.g., for HTTPS requests), configure the agent to inject them into pods:
```yaml ```yaml
agentk8sglue: agentk8sglue:
@ -195,7 +194,7 @@ Their names are usually prefixed with the Helm release name, so adjust according
### Apply Changes ### Apply Changes
Applying the changes by running the the update command: Apply the changes by running the update command:
``` bash ``` bash
helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml

View File

@ -3,7 +3,8 @@ title: SSO (Identity Provider) Setup
--- ---
ClearML Enterprise Server supports various Single Sign-On (SSO) identity providers. ClearML Enterprise Server supports various Single Sign-On (SSO) identity providers.
SSO configuration is managed via environment variables in your `clearml-values.override.yaml` file and applied to the `apiserver` component. SSO configuration is managed via environment variables in your `clearml-values.override.yaml` file and is applied to the
`apiserver` component.
The following are configuration examples for commonly used providers. Other supported systems include: The following are configuration examples for commonly used providers. Other supported systems include:
* Auth0 * Auth0
@ -11,7 +12,7 @@ The following are configuration examples for commonly used providers. Other supp
* Okta * Okta
* Azure AD * Azure AD
* Google * Google
* and AWS Cognito * AWS Cognito
## Auth0 ## Auth0
@ -56,17 +57,17 @@ apiserver:
To map Keycloak groups into the ClearML user's SSO token: To map Keycloak groups into the ClearML user's SSO token:
1. Go to the **Client Scopes** tab. 1. Go to the **Client Scopes** tab.
1. Click on the first row `<clearml client>-dedicated`. 1. Click on the `<clearml client>-dedicated` scope.
1. Click **Add Mapper > By configuration > Group membership** 1. Click **Add Mapper > By Configuration > Group Membership**
1. In the dialog: 1. Configure the mapper:
* select the **Name** "groups" * Select the **Name** "groups"
* Set **Token Claim Name** "groups" * Set **Token Claim Name** "groups"
* Uncheck the **Full group path** * Uncheck the **Full group path**
* Save the mapper. * Save the mapper.
To verify: To verify:
1. Return to **Client Details > Client scope** tab. 1. Go to the **Client Details > Client scope** tab.
1. Go to the Evaluate sub-tab and select a user who has any group memberships. 1. Go to the **Evaluate** sub-tab and select a user with any group memberships.
1. Go to **Generated ID token** and then to **Generated User Info**. 1. Go to **Generated ID Token** and then to **Generated User Info**.
1Inspect that in both cases you can see the group's claim in the displayed user data. 1. Inspect that in both cases you can see the group's claim in the displayed user data.

View File

@ -2,14 +2,12 @@
title: ClearML Dynamic MIG Operator (CDMO) title: ClearML Dynamic MIG Operator (CDMO)
--- ---
The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG GPU configurations. The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG (Multi-Instance GPU) configurations.
## Installation ## Installation
### Requirements ### Requirements
* Install the official NVIDIA `gpu-operator` using Helm with one of the following configurations.
* Add and update the Nvidia Helm repo: * Add and update the Nvidia Helm repo:
```bash ```bash
@ -46,7 +44,7 @@ The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG GPU configurations.
value: all value: all
``` ```
* Install the official NVIDIA `gpu-operator` using Helm with the previous configuration: * Install the NVIDIA `gpu-operator` using Helm with the previous configuration:
```bash ```bash
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
@ -54,33 +52,33 @@ The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG GPU configurations.
### Installing CDMO ### Installing CDMO
* Create a `cdmo-values.override.yaml` file with the following content: 1. Create a `cdmo-values.override.yaml` file with the following content:
```yaml ```yaml
imageCredentials: imageCredentials:
password: "<CLEARML_DOCKERHUB_TOKEN>" password: "<CLEARML_DOCKERHUB_TOKEN>"
``` ```
* Install the CDMO Helm Chart using the previous override file: 1. Install the CDMO Helm Chart using the previous override file:
```bash ```bash
helm install -n cdmo cdmo clearml-enterprise/clearml-dynamic-mig-operator --create-namespace -f cdmo-values.override.yaml helm install -n cdmo cdmo clearml-enterprise/clearml-dynamic-mig-operator --create-namespace -f cdmo-values.override.yaml
``` ```
* Enable the NVIDIA MIG support on your cluster by running the following command on all nodes with a MIG-supported GPU 1. Enable the NVIDIA MIG support on your cluster by running the following command on all nodes with a MIG-supported GPU
(run it for each GPU `<GPU_ID>` on the host): (run it for each GPU `<GPU_ID>` on the host):
```bash ```bash
nvidia-smi -mig 1 nvidia-smi -mig 1
``` ```
:::note notes :::note notes
* The node reboot may be required if the command output indicates so. * A node reboot may be required if the command output indicates so.
* For convenience, this command can be run from within the `nvidia-device-plugin-daemonset` pod running on the related node.
:::
* For convenience, this command can be issued from inside the `nvidia-device-plugin-daemonset` pod running on the related node. 1. Label all MIG-enabled GPU node `<NODE_NAME>` from the previous step:
:::
* Any MIG-enabled GPU node `<NODE_NAME>` from the last point must be labeled accordingly as follows:
```bash ```bash
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning=mig" kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning=mig"
@ -88,7 +86,7 @@ The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG GPU configurations.
## Disabling MIGs ## Disabling MIGs
To disable MIG, follow these steps: To disable MIG mode and restore standard full-GPU access:
1. Ensure no running workflows are using GPUs on the target node(s). 1. Ensure no running workflows are using GPUs on the target node(s).
@ -108,7 +106,7 @@ To disable MIG, follow these steps:
nvidia-smi -mig 0 nvidia-smi -mig 0
``` ```
4. Edit the `gpu-operator.override.yaml` file to have a standard configuration for full GPUs, and upgrade the `gpu-operator`: 4. Edit the `gpu-operator.override.yaml` file to restore full-GPU access, and upgrade the `gpu-operator`:
```yaml ```yaml
toolkit: toolkit:

View File

@ -1,7 +1,8 @@
--- ---
title: Install CDMO and CFGI on the same Cluster title: Install CDMO and CFGI on the Same Cluster
--- ---
You can install both CDMO (ClearML Dynamic MIG Orchestrator) and CFGI (ClearML Fractional GPU Injector) on a shared Kubernetes cluster.
In clusters with multiple nodes and varying GPU types, the `gpu-operator` can be used to manage different device configurations In clusters with multiple nodes and varying GPU types, the `gpu-operator` can be used to manage different device configurations
and fractioning modes. and fractioning modes.
@ -11,7 +12,7 @@ The NVIDIA `gpu-operator` supports defining multiple configurations for the Devi
The following example YAML defines two configurations: "mig" and "ts" (time-slicing). The following example YAML defines two configurations: "mig" and "ts" (time-slicing).
``` yaml ```yaml
migManager: migManager:
enabled: false enabled: false
mig: mig:
@ -69,24 +70,15 @@ devicePlugin:
## Applying Configuration to Nodes ## Applying Configuration to Nodes
To activate a configuration, label the Kubernetes node accordingly. After a node is labeled, Label each Kubernetes node accordingly to activate a specific GPU mode:
the NVIDIA `device-plugin` will automatically reload the new configuration.
Example usage: |Mode| Label command|
* Apply the `mig` (MIG mode) config: |----|-----|
``` bash | `mig` | `kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=mig` |
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=mig | `ts` (time slicing) | `kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=ts` |
``` | Standard full-GPU access | `kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled` |
* Apply the `ts` (time slicing) config: After a node is labeled, the NVIDIA `device-plugin` will automatically reload the new configuration.
``` bash
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=ts
```
* Apply the `all-disabled` (standard full GPU access) config:
``` bash
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled
```
## Installing CDMO and CFGI ## Installing CDMO and CFGI
@ -97,22 +89,26 @@ and [CFGI](cfgi.md).
### Time Slicing ### Time Slicing
To switch between time-slicing and full GPU access, update the node label using the `--overwrite` flag: To disable time-slicing and use full GPU access, update the node label using the `--overwrite` flag:
```bash
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled --overwrite
```
### MIG ### MIG
To disable MIG mode: To disable MIG mode:
1. Ensure there are no more running workflows requesting any form of GPU on the node(s) before re-configuring it. 1. Ensure there are no more running workflows requesting any form of GPU on the node(s).
2. Remove the CDMO label from the target node(s) to disable the dynamic MIG reconfiguration. 2. Remove the CDMO label from the target node(s).
``` bash ```bash
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning-" kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning-"
``` ```
3. Execute a shell in the `device-plugin-daemonset` Pod instance running on the target node(s) and execute the following commands: 3. Execute a shell in the `device-plugin-daemonset` pod instance running on the target node(s) and execute the following commands:
``` bash ```bash
nvidia-smi mig -dci nvidia-smi mig -dci
nvidia-smi mig -dgi nvidia-smi mig -dgi
@ -120,8 +116,8 @@ To disable MIG mode:
nvidia-smi -mig 0 nvidia-smi -mig 0
``` ```
4. Relabel the target node to disable MIG: 4. Label the node to use standard (non-MIG) GPU mode:
``` bash ```bash
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled --overwrite kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled --overwrite
``` ```

View File

@ -2,34 +2,36 @@
title: ClearML Fractional GPU Injector (CFGI) title: ClearML Fractional GPU Injector (CFGI)
--- ---
The **ClearML Enterprise Fractional GPU Injector** (CFGI) allows AI workloads to run on Kubernetes using non-MIG GPU The **ClearML Enterprise Fractional GPU Injector** (CFGI) allows AI workloads to utilize fractional (non-MIG) GPU slices
fractions, optimizing both hardware utilization and performance. on Kubernetes clusters, maximizing hardware efficiency and performance.
## Installation ## Installation
### Add the Local ClearML Helm Repository ### Add the Local ClearML Helm Repository
``` bash ```bash
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <GITHUB_TOKEN> --password <GITHUB_TOKEN> helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <GITHUB_TOKEN> --password <GITHUB_TOKEN>
helm repo update helm repo update
``` ```
### Requirements ### Requirements
* Install the official NVIDIA `gpu-operator` using Helm with one of the following configurations. * Install the NVIDIA `gpu-operator` using Helm
* The number of slices must be 8. * Set the number of GPU slices to 8
* Add and update the Nvidia Helm repo: * Add and update the Nvidia Helm repo:
``` bash ```bash
helm repo add nvidia https://nvidia.github.io/gpu-operator helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update helm repo update
``` ```
* Credentials for the ClearML Enterprise DockerHub repository
#### GPU Operator Configuration ### GPU Operator Configuration
##### For CFGI Version >= 1.3.0 #### For CFGI Version >= 1.3.0
1. Create a docker-registry secret named `clearml-dockerhub-access` in the `gpu-operator` Namespace, making sure to replace your `<CLEARML_DOCKERHUB_TOKEN>`: 1. Create a Docker Registry secret named `clearml-dockerhub-access` in the `gpu-operator` namespace. Make sure to replace `<CLEARML_DOCKERHUB_TOKEN>` with your token.
```bash ```bash
kubectl create secret -n gpu-operator docker-registry clearml-dockerhub-access \ kubectl create secret -n gpu-operator docker-registry clearml-dockerhub-access \
@ -101,11 +103,11 @@ devicePlugin:
replicas: 8 replicas: 8
``` ```
##### For CFGI version < 1.3.0 (Legacy GPU Operator) #### For CFGI version < 1.3.0 (Legacy)
Create a `gpu-operator.override.yaml` file: Create a `gpu-operator.override.yaml` file:
``` yaml ```yaml
toolkit: toolkit:
env: env:
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
@ -144,26 +146,26 @@ devicePlugin:
replicas: 8 replicas: 8
``` ```
### Install ### Install GPU Operator and CFGI
Install the nvidia `gpu-operator` using the previously created `gpu-operator.override.yaml` override file: 1. Install the NVIDIA `gpu-operator` using the previously created `gpu-operator.override.yaml` file:
```bash ```bash
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
``` ```
Create a `cfgi-values.override.yaml` file with the following content: 1. Create a `cfgi-values.override.yaml` file with the following content:
```yaml ```yaml
imageCredentials: imageCredentials:
password: "<CLEARML_DOCKERHUB_TOKEN>" password: "<CLEARML_DOCKERHUB_TOKEN>"
``` ```
Install the CFGI Helm Chart using the previous override file: 1. Install the CFGI Helm Chart using the previous override file:
```bash ```bash
helm install -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector --create-namespace -f cfgi-values.override.yaml helm install -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector --create-namespace -f cfgi-values.override.yaml
``` ```
## Usage ## Usage
@ -187,9 +189,9 @@ Valid values for `"<GPU_FRACTION_VALUE>"` include:
* "0.875" * "0.875"
* Integer representation of GPUs such as `1.000`, `2`, `2.0`, etc. * Integer representation of GPUs such as `1.000`, `2`, `2.0`, etc.
### ClearML Enterprise Agent Configuration ### ClearML Agent Configuration
To run ClearML jobs that request specific GPU fractions, configure the queues in your `clearml-agent-values.override.yaml` file. To run ClearML jobs with fractional GPU allocation, configure your queues in accordingly in your `clearml-agent-values.override.yaml` file.
Each queue should include a `templateOverride` that sets the `clearml-injector/fraction` label, which determines the Each queue should include a `templateOverride` that sets the `clearml-injector/fraction` label, which determines the
fraction of a GPU to allocate (e.g., "0.500" for half a GPU). fraction of a GPU to allocate (e.g., "0.500" for half a GPU).
@ -259,16 +261,16 @@ agentk8sglue:
nvidia.com/gpu: 1 nvidia.com/gpu: 1
``` ```
## Upgrading Chart ## Upgrading CFGI Chart
To upgrade to the latest version of this chart: To upgrade to the latest chart version:
```bash ```bash
helm repo update helm repo update
helm upgrade -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector helm upgrade -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector
``` ```
To apply changes to values on an existing installation: To apply new values to an existing installation:
```bash ```bash
helm upgrade -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector -f cfgi-values.override.yaml helm upgrade -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector -f cfgi-values.override.yaml