This commit is contained in:
revital 2025-05-15 14:55:35 +03:00
parent d94d777e55
commit aaa3851de3
10 changed files with 139 additions and 147 deletions

View File

@ -6,8 +6,8 @@ The ClearML Agent enables scheduling and executing distributed experiments on a
## Prerequisites
- A ClearML Enterprise server is up and running.
- Generate a set of `<ACCESS_KEY>` and `<SECRET_KEY>` credentials in the ClearML Server. The easiest way is via
- A running [ClearML Enterprise Server](k8s.md)
- API credentials (`<ACCESS_KEY>` and `<SECRET_KEY>`) generated via
the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). For more information, see [ClearML API Credentials](../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials).
:::note
@ -15,7 +15,7 @@ The ClearML Agent enables scheduling and executing distributed experiments on a
:::
- The worker environment must be able to access the ClearML Server over the same network.
- * Helm token to access `clearml-enterprise` helm-chart repo
- Helm token to access `clearml-enterprise` Helm chart repo
## Installation
@ -36,9 +36,9 @@ helm repo update
Create a `clearml-agent-values.override.yaml` file with the following content:
:::note
Replace the `<ACCESS_KEY>` and `<SECRET_KEY>` with the admin credentials
you created earlier. Set `<api|file|web>ServerUrlReference` to the relevant URLs of your ClearML
Server installation.
Replace the `<ACCESS_KEY>` and `<SECRET_KEY>`with the API credentials you generated earlier.
Set the `<api|file|web>ServerUrlReference` fields to match your ClearML
Server URLs.
:::
```yaml
@ -60,7 +60,7 @@ agentk8sglue:
### Install the Chart
Install the ClearML Enterprise Agent Helm chart using the previous values override file:
Install the ClearML Enterprise Agent Helm chart:
```bash
helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml
@ -68,7 +68,7 @@ helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-e
## Additional Configuration Options
To view all configurable options for the Helm chart, run the following command:
To view available configuration options for the Helm chart, run the following command:
```bash
helm show readme clearml-enterprise/clearml-enterprise-agent
@ -76,7 +76,7 @@ helm show readme clearml-enterprise/clearml-enterprise-agent
helm show values clearml-enterprise/clearml-enterprise-agent
```
### Set GPU Availability in Orchestration Dashboard
### Reporting GPU Availability to Orchestration Dashboard
To show GPU availability in the [Orchestration Dashboard](../../webapp/webapp_orchestration_dash.md), explicitly set the number of GPUs:
@ -88,25 +88,22 @@ agentk8sglue:
### Queues
The ClearML Agent monitors ClearML queues and pulls tasks that are scheduled for execution.
The ClearML Agent monitors [ClearML queues](../../fundamentals/agents_and_queues.md) and pulls tasks that are
scheduled for execution.
A single agent can monitor multiple queues. By default, the queues share a base pod template (`agentk8sglue.basePodTemplate`)
used when submitting a task to Kubernetes after it has been extracted from the queue.
A single agent can monitor multiple queues. By default, all queues share a base pod template (`agentk8sglue.basePodTemplate`)
used when launching tasks on Kubernetes after it has been pulled from the queue.
Each queue can be configured with dedicated Pod template spec override (`templateOverrides`). This way queue definitions
can be tailored to different use cases.
Each queue can be configured to override the base pod template with its own settings with a `templateOverrides` queue template.
This way queue definitions can be tailored to different use cases.
The following are a few examples of agent queue templates:
#### Example: GPU Queues
To support GPU queues, you must deploy the NVIDIA GPU Operator on your Kubernetes cluster. For more information, see [GPU Operator](extra_configs/gpu_operator.md).
GPU queue support requires deploying the NVIDIA GPU Operator on your Kubernetes cluster.
For more information, see [GPU Operator](extra_configs/gpu_operator.md).
``` yaml
```yaml
agentk8sglue:
createQueues: true
queues:
@ -122,8 +119,9 @@ agentk8sglue:
nvidia.com/gpu: 2
```
#### Example: Overriding Pod Templates per Queue
#### Example: Custom Pod Template per Queue
This example demonstrates how to override the base pod template definitions on a per-queue basis.
In this example:
- The `red` queue inherits both the label `team=red` and the 1Gi memory limit from the `basePodTemplate` section.
@ -167,5 +165,5 @@ agentk8sglue:
## Next Steps
Once the agent is up and running, proceed with deploying the[ ClearML Enterprise App Gateway](appgw_install_k8s.md).
Once the agent is up and running, proceed with deploying the [ClearML Enterprise App Gateway](appgw_install_k8s.md).

View File

@ -2,18 +2,15 @@
title: Dynamically Edit Task Pod Template
---
The ClearML Enterprise Agent supports defining custom Python code to modify a task's Pod template before it is applied
to Kubernetes.
ClearML Agent allows you to inject custom Python code to dynamically modify the Kubernetes Pod template before applying it.
This enables dynamic customization of Task Pod manifests in the context of a ClearML Enterprise Agent, which is useful
for injecting values or changing configurations based on runtime context.
## Agent Configuration
The `CLEARML_K8S_GLUE_TEMPLATE_MODULE` environment variable defines the Python module and function inside that
module that the ClearML Enterprise Agent should invoke before applying a Task Pod template.
module to be invoked by the agent before applying a task pod template.
The Agent will run this code in its own context, pass arguments (including the actual template) to the function, and use
The agent will run this code in its own context, pass arguments (including the actual template) to the function, and use
the returned template to create the final Task Pod in Kubernetes.
Arguments passed to the function include:
@ -60,13 +57,13 @@ agentk8sglue:
```
:::note notes
* Make sure to include `*args, **kwargs` at the end of the function's argument list and to only use keyword arguments.
* Always include `*args, **kwargs` at the end of the function's argument list and only use keyword arguments.
This is needed to maintain backward compatibility.
* Custom code modules can be included as a file in the pod's container, and the environment variable can be used to
point to the file and entry point.
* When defining a custom code module, by default the Agent will start watching pods in all namespaces
* When defining a custom code module, by default the agent will start watching pods in all namespaces
across the cluster. If you do not intend to give a `ClusterRole` permission, make sure to set the
`CLEARML_K8S_GLUE_MONITOR_ALL_NAMESPACES` env to `"0"` to prevent the Agent to try listing pods in all namespaces.
Instead, set it to `"1"` if namespace-related changes are needed in the code.
@ -80,13 +77,13 @@ agentk8sglue:
To customize the bash startup scripts instead of the pod spec, use:
```yaml
agentk8sglue:
```yaml
agentk8sglue:
# -- Custom Bash script for the Agent pod ran by Glue Agent
customBashScript: ""
# -- Custom Bash script for the Task Pods ran by Glue Agent
containerCustomBashScript: ""
```
```
## Examples
@ -167,7 +164,7 @@ agentk8sglue:
### Example: Bind PVC Resource to Task Pod
In this example, a PVC is created and attached to every Pod created from a dedicated queue, then deleted afterwards.
In this example, a PVC is created and attached to every pod created from a dedicated queue, then it is deleted.
Key points:

View File

@ -2,10 +2,12 @@
title: Basic Deployment - Suggested GPU Operator Values
---
This guide provides recommended configuration values for deploying the NVIDIA GPU Operator alongside ClearML Enterprise.
## Add the Helm Repo Locally
Add the NVIDIA GPU Operator Helm repository:
```bash
helm repo add nvidia https://nvidia.github.io/gpu-operator
```
@ -17,10 +19,8 @@ helm repo update
## Installation
To prevent unprivileged containers from bypassing the Kubernetes Device Plugin API, configure the GPU operator with the
following override values.
Create a `gpu-operator.override.yaml` file with the following content:
To prevent unprivileged containers from bypassing the Kubernetes Device Plugin API, configure the GPU operator
using the following `gpu-operator.override.yaml` file:
```yaml
toolkit:
@ -53,7 +53,7 @@ helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace
## Fractional GPU Support
For support with fractional GPUs, refer to the dedicated guides:
To enable fractional GPU allocation or manage mixed GPU configurations, refer to the following guides:
* [ClearML Dynamic MIG Operator](../fractional_gpus/cdmo.md) (CDMO) Dynamically configures MIG GPUs on supported devices.
* [ClearML Enterprise Fractional GPU Injector](../fractional_gpus/cfgi.md) (CFGI) Enables fractional (non-MIG) GPU
allocation for better hardware utilization and workload distribution in Kubernetes.

View File

@ -2,7 +2,8 @@
title: Multi-Node Training
---
The ClearML Enterprise Agent supports horizontal multi-node training--running a single Task across multiple Pods on different nodes.
The ClearML Enterprise Agent supports horizontal multi-node training, allowing a single Task to run across multiple pods
on different nodes.
Below is a configuration example using `clearml-agent-values.override.yaml`:

View File

@ -7,16 +7,16 @@ users, enabling direct access to cloud-hosted data (e.g., S3) without exposing c
## Prerequisites
- A ClearML Enterprise server is up and running.
- Generate `<ACCESS_KEY>` and `<SECRET_KEY>` credentials in the ClearML Server. The easiest way is via the ClearML UI
(**Settings > Workspace > App Credentials > Create new credentials**).
- A ClearML Enterprise Server is up and running.
- API credentials (`<ACCESS_KEY>` and `<SECRET_KEY>`) generated via
the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). For more information, see [ClearML API Credentials](../../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials).
:::note
Make sure these credentials belong to an admin user or a service user with admin privileges.
:::
- The worker environment must be able to access the ClearML Server over the same network.
- Token to access `clearml-enterprise` Helm chart repo
## Installation
@ -50,7 +50,7 @@ ingress:
### Deploy the Helm Chart
Install the `clearml-presign-service` helm chart in the same namespace as the ClearML Enterprise server:
Install the `clearml-presign-service` Helm chart in the same namespace as the ClearML Enterprise server:
```bash
helm install -n clearml clearml-presign-service clearml-enterprise/clearml-presign-service -f presign-service.override.yaml

View File

@ -2,10 +2,8 @@
title: ClearML Tenant with Self Signed Certificates
---
This guide covers the configuration to support SSL Custom certificates for the following components:
- ClearML Enterprise AI Application Gateway
- ClearML Enterprise Agent
This guide covers how to configure the [AI Application Gateway](#ai-application-gateway) and [ClearML Agent](#clearml-agent)
to use self-signed or custom SSL certificates.
## AI Application Gateway
@ -25,15 +23,15 @@ customCertificates:
-----END CERTIFICATE-----
```
In this section, there are two options:
You have two configuration options:
- [**Replace**](#replace-the-whole-ca-certificatescrt-file) the entire `ca-certificates.crt` file
- [**Replace**](#replace-entire-ca-certificatescrt-file) the entire `ca-certificates.crt` file
- [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt) extra certificates to the existing `ca-certificates.crt`
### Replace Entire `ca-certificates.crt` File
To replace the whole ca-bundle, you can attach a concatenation of all your valid CA in a `pem` format as
To replace the whole ca-bundle, provide a concatenated list of all trusted CA certificates in `pem` format as
they are stored in a standard `ca-certificates.crt`.
```yaml
@ -55,7 +53,7 @@ customCertificates:
### Append Extra Certificates to the Existing `ca-certificates.crt`
You can add certificates to the existing CA bundle. Ensure each certificate has a unique `alias`.
You can add certificates to the existing CA bundle. Each certificate must have a unique `alias`.
```yaml
# -- Custom certificates
@ -82,9 +80,9 @@ To apply the changes, run the update command:
helm upgrade -i <RELEASE_NAME> -n <WORKLOAD_NAMESPACE> clearml-enterprise/clearml-enterprise-app-gateway --version <CHART_VERSION> -f clearml-app-gateway-values.override.yaml
```
## ClearML Enterprise Agent
## ClearML Agent
For the Agent, configure certificates in the `clearml-agent-values.override.yaml` file:
For the ClearML Agent, configure certificates in the `clearml-agent-values.override.yaml` file:
```yaml
# -- Custom certificates
@ -100,17 +98,18 @@ customCertificates:
-----END CERTIFICATE-----
```
In the section, there are two options:
You have two configuration options:
- [**Replace**](#replace-the-whole-ca-certificatescrt-file) the entire `ca-certificates.crt` file
- [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt) extra certificates to the existing `ca-certificates.crt`
- [**Replace**](#replace-entire-ca-certificatescrt-file-1) the entire `ca-certificates.crt` file
- [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt-1) extra certificates to the existing `ca-certificates.crt`
### Replace Entire `ca-certificates.crt` File
If you need to replace the whole ca-bundle you can attach a concatenation of all your valid CA in a `pem` format like
To replace the whole ca-bundle, provide a concatenated list of all trusted CA certificates in `pem` format as
they are stored in a standard `ca-certificates.crt`.
```yaml
# -- Custom certificates
customCertificates:
@ -130,7 +129,7 @@ customCertificates:
### Append Extra Certificates to the Existing `ca-certificates.crt`
You can add certificates to the existing CA bundle. Ensure each certificate has a unique `alias`.
You can add certificates to the existing CA bundle. Each certificate must have a unique `alias`.
```yaml
# -- Custom certificates
@ -151,7 +150,7 @@ customCertificates:
### Add Certificates to Task Pods
If your workloads need access to these certificates (e.g., for HTTPS requests), configure the agent to inject them into Pods:
If your workloads need access to these certificates (e.g., for HTTPS requests), configure the agent to inject them into pods:
```yaml
agentk8sglue:
@ -195,7 +194,7 @@ Their names are usually prefixed with the Helm release name, so adjust according
### Apply Changes
Applying the changes by running the the update command:
Apply the changes by running the update command:
``` bash
helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml

View File

@ -3,7 +3,8 @@ title: SSO (Identity Provider) Setup
---
ClearML Enterprise Server supports various Single Sign-On (SSO) identity providers.
SSO configuration is managed via environment variables in your `clearml-values.override.yaml` file and applied to the `apiserver` component.
SSO configuration is managed via environment variables in your `clearml-values.override.yaml` file and is applied to the
`apiserver` component.
The following are configuration examples for commonly used providers. Other supported systems include:
* Auth0
@ -11,7 +12,7 @@ The following are configuration examples for commonly used providers. Other supp
* Okta
* Azure AD
* Google
* and AWS Cognito
* AWS Cognito
## Auth0
@ -56,17 +57,17 @@ apiserver:
To map Keycloak groups into the ClearML user's SSO token:
1. Go to the **Client Scopes** tab.
1. Click on the first row `<clearml client>-dedicated`.
1. Click **Add Mapper > By configuration > Group membership**
1. In the dialog:
* select the **Name** "groups"
1. Click on the `<clearml client>-dedicated` scope.
1. Click **Add Mapper > By Configuration > Group Membership**
1. Configure the mapper:
* Select the **Name** "groups"
* Set **Token Claim Name** "groups"
* Uncheck the **Full group path**
* Save the mapper.
To verify:
1. Return to **Client Details > Client scope** tab.
1. Go to the Evaluate sub-tab and select a user who has any group memberships.
1. Go to **Generated ID token** and then to **Generated User Info**.
1Inspect that in both cases you can see the group's claim in the displayed user data.
1. Go to the **Client Details > Client scope** tab.
1. Go to the **Evaluate** sub-tab and select a user with any group memberships.
1. Go to **Generated ID Token** and then to **Generated User Info**.
1. Inspect that in both cases you can see the group's claim in the displayed user data.

View File

@ -2,14 +2,12 @@
title: ClearML Dynamic MIG Operator (CDMO)
---
The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG GPU configurations.
The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG (Multi-Instance GPU) configurations.
## Installation
### Requirements
* Install the official NVIDIA `gpu-operator` using Helm with one of the following configurations.
* Add and update the Nvidia Helm repo:
```bash
@ -46,7 +44,7 @@ The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG GPU configurations.
value: all
```
* Install the official NVIDIA `gpu-operator` using Helm with the previous configuration:
* Install the NVIDIA `gpu-operator` using Helm with the previous configuration:
```bash
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
@ -54,33 +52,33 @@ The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG GPU configurations.
### Installing CDMO
* Create a `cdmo-values.override.yaml` file with the following content:
1. Create a `cdmo-values.override.yaml` file with the following content:
```yaml
imageCredentials:
password: "<CLEARML_DOCKERHUB_TOKEN>"
```
* Install the CDMO Helm Chart using the previous override file:
1. Install the CDMO Helm Chart using the previous override file:
```bash
helm install -n cdmo cdmo clearml-enterprise/clearml-dynamic-mig-operator --create-namespace -f cdmo-values.override.yaml
```
* Enable the NVIDIA MIG support on your cluster by running the following command on all nodes with a MIG-supported GPU
1. Enable the NVIDIA MIG support on your cluster by running the following command on all nodes with a MIG-supported GPU
(run it for each GPU `<GPU_ID>` on the host):
```bash
nvidia-smi -mig 1
```
:::note notes
* The node reboot may be required if the command output indicates so.
:::note notes
* A node reboot may be required if the command output indicates so.
* For convenience, this command can be issued from inside the `nvidia-device-plugin-daemonset` pod running on the related node.
:::
* For convenience, this command can be run from within the `nvidia-device-plugin-daemonset` pod running on the related node.
:::
* Any MIG-enabled GPU node `<NODE_NAME>` from the last point must be labeled accordingly as follows:
1. Label all MIG-enabled GPU node `<NODE_NAME>` from the previous step:
```bash
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning=mig"
@ -88,7 +86,7 @@ The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG GPU configurations.
## Disabling MIGs
To disable MIG, follow these steps:
To disable MIG mode and restore standard full-GPU access:
1. Ensure no running workflows are using GPUs on the target node(s).
@ -108,7 +106,7 @@ To disable MIG, follow these steps:
nvidia-smi -mig 0
```
4. Edit the `gpu-operator.override.yaml` file to have a standard configuration for full GPUs, and upgrade the `gpu-operator`:
4. Edit the `gpu-operator.override.yaml` file to restore full-GPU access, and upgrade the `gpu-operator`:
```yaml
toolkit:

View File

@ -1,7 +1,8 @@
---
title: Install CDMO and CFGI on the same Cluster
title: Install CDMO and CFGI on the Same Cluster
---
You can install both CDMO (ClearML Dynamic MIG Orchestrator) and CFGI (ClearML Fractional GPU Injector) on a shared Kubernetes cluster.
In clusters with multiple nodes and varying GPU types, the `gpu-operator` can be used to manage different device configurations
and fractioning modes.
@ -11,7 +12,7 @@ The NVIDIA `gpu-operator` supports defining multiple configurations for the Devi
The following example YAML defines two configurations: "mig" and "ts" (time-slicing).
``` yaml
```yaml
migManager:
enabled: false
mig:
@ -69,24 +70,15 @@ devicePlugin:
## Applying Configuration to Nodes
To activate a configuration, label the Kubernetes node accordingly. After a node is labeled,
the NVIDIA `device-plugin` will automatically reload the new configuration.
Label each Kubernetes node accordingly to activate a specific GPU mode:
Example usage:
* Apply the `mig` (MIG mode) config:
``` bash
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=mig
```
|Mode| Label command|
|----|-----|
| `mig` | `kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=mig` |
| `ts` (time slicing) | `kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=ts` |
| Standard full-GPU access | `kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled` |
* Apply the `ts` (time slicing) config:
``` bash
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=ts
```
* Apply the `all-disabled` (standard full GPU access) config:
``` bash
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled
```
After a node is labeled, the NVIDIA `device-plugin` will automatically reload the new configuration.
## Installing CDMO and CFGI
@ -97,22 +89,26 @@ and [CFGI](cfgi.md).
### Time Slicing
To switch between time-slicing and full GPU access, update the node label using the `--overwrite` flag:
To disable time-slicing and use full GPU access, update the node label using the `--overwrite` flag:
```bash
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled --overwrite
```
### MIG
To disable MIG mode:
1. Ensure there are no more running workflows requesting any form of GPU on the node(s) before re-configuring it.
2. Remove the CDMO label from the target node(s) to disable the dynamic MIG reconfiguration.
1. Ensure there are no more running workflows requesting any form of GPU on the node(s).
2. Remove the CDMO label from the target node(s).
``` bash
```bash
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning-"
```
3. Execute a shell in the `device-plugin-daemonset` Pod instance running on the target node(s) and execute the following commands:
3. Execute a shell in the `device-plugin-daemonset` pod instance running on the target node(s) and execute the following commands:
``` bash
```bash
nvidia-smi mig -dci
nvidia-smi mig -dgi
@ -120,8 +116,8 @@ To disable MIG mode:
nvidia-smi -mig 0
```
4. Relabel the target node to disable MIG:
4. Label the node to use standard (non-MIG) GPU mode:
``` bash
```bash
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled --overwrite
```

View File

@ -2,34 +2,36 @@
title: ClearML Fractional GPU Injector (CFGI)
---
The **ClearML Enterprise Fractional GPU Injector** (CFGI) allows AI workloads to run on Kubernetes using non-MIG GPU
fractions, optimizing both hardware utilization and performance.
The **ClearML Enterprise Fractional GPU Injector** (CFGI) allows AI workloads to utilize fractional (non-MIG) GPU slices
on Kubernetes clusters, maximizing hardware efficiency and performance.
## Installation
### Add the Local ClearML Helm Repository
``` bash
```bash
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <GITHUB_TOKEN> --password <GITHUB_TOKEN>
helm repo update
```
### Requirements
* Install the official NVIDIA `gpu-operator` using Helm with one of the following configurations.
* The number of slices must be 8.
* Install the NVIDIA `gpu-operator` using Helm
* Set the number of GPU slices to 8
* Add and update the Nvidia Helm repo:
``` bash
```bash
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
```
#### GPU Operator Configuration
* Credentials for the ClearML Enterprise DockerHub repository
##### For CFGI Version >= 1.3.0
### GPU Operator Configuration
1. Create a docker-registry secret named `clearml-dockerhub-access` in the `gpu-operator` Namespace, making sure to replace your `<CLEARML_DOCKERHUB_TOKEN>`:
#### For CFGI Version >= 1.3.0
1. Create a Docker Registry secret named `clearml-dockerhub-access` in the `gpu-operator` namespace. Make sure to replace `<CLEARML_DOCKERHUB_TOKEN>` with your token.
```bash
kubectl create secret -n gpu-operator docker-registry clearml-dockerhub-access \
@ -101,11 +103,11 @@ devicePlugin:
replicas: 8
```
##### For CFGI version < 1.3.0 (Legacy GPU Operator)
#### For CFGI version < 1.3.0 (Legacy)
Create a `gpu-operator.override.yaml` file:
``` yaml
```yaml
toolkit:
env:
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
@ -144,26 +146,26 @@ devicePlugin:
replicas: 8
```
### Install
### Install GPU Operator and CFGI
Install the nvidia `gpu-operator` using the previously created `gpu-operator.override.yaml` override file:
1. Install the NVIDIA `gpu-operator` using the previously created `gpu-operator.override.yaml` file:
```bash
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
```
```bash
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
```
Create a `cfgi-values.override.yaml` file with the following content:
1. Create a `cfgi-values.override.yaml` file with the following content:
```yaml
imageCredentials:
```yaml
imageCredentials:
password: "<CLEARML_DOCKERHUB_TOKEN>"
```
```
Install the CFGI Helm Chart using the previous override file:
1. Install the CFGI Helm Chart using the previous override file:
```bash
helm install -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector --create-namespace -f cfgi-values.override.yaml
```
```bash
helm install -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector --create-namespace -f cfgi-values.override.yaml
```
## Usage
@ -187,9 +189,9 @@ Valid values for `"<GPU_FRACTION_VALUE>"` include:
* "0.875"
* Integer representation of GPUs such as `1.000`, `2`, `2.0`, etc.
### ClearML Enterprise Agent Configuration
### ClearML Agent Configuration
To run ClearML jobs that request specific GPU fractions, configure the queues in your `clearml-agent-values.override.yaml` file.
To run ClearML jobs with fractional GPU allocation, configure your queues in accordingly in your `clearml-agent-values.override.yaml` file.
Each queue should include a `templateOverride` that sets the `clearml-injector/fraction` label, which determines the
fraction of a GPU to allocate (e.g., "0.500" for half a GPU).
@ -259,16 +261,16 @@ agentk8sglue:
nvidia.com/gpu: 1
```
## Upgrading Chart
## Upgrading CFGI Chart
To upgrade to the latest version of this chart:
To upgrade to the latest chart version:
```bash
helm repo update
helm upgrade -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector
```
To apply changes to values on an existing installation:
To apply new values to an existing installation:
```bash
helm upgrade -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector -f cfgi-values.override.yaml