mirror of
https://github.com/clearml/clearml-docs
synced 2025-06-10 00:26:30 +00:00
Edits
This commit is contained in:
parent
d94d777e55
commit
aaa3851de3
@ -6,8 +6,8 @@ The ClearML Agent enables scheduling and executing distributed experiments on a
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- A ClearML Enterprise server is up and running.
|
||||
- Generate a set of `<ACCESS_KEY>` and `<SECRET_KEY>` credentials in the ClearML Server. The easiest way is via
|
||||
- A running [ClearML Enterprise Server](k8s.md)
|
||||
- API credentials (`<ACCESS_KEY>` and `<SECRET_KEY>`) generated via
|
||||
the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). For more information, see [ClearML API Credentials](../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials).
|
||||
|
||||
:::note
|
||||
@ -15,7 +15,7 @@ The ClearML Agent enables scheduling and executing distributed experiments on a
|
||||
:::
|
||||
|
||||
- The worker environment must be able to access the ClearML Server over the same network.
|
||||
- * Helm token to access `clearml-enterprise` helm-chart repo
|
||||
- Helm token to access `clearml-enterprise` Helm chart repo
|
||||
|
||||
## Installation
|
||||
|
||||
@ -36,9 +36,9 @@ helm repo update
|
||||
Create a `clearml-agent-values.override.yaml` file with the following content:
|
||||
|
||||
:::note
|
||||
Replace the `<ACCESS_KEY>` and `<SECRET_KEY>` with the admin credentials
|
||||
you created earlier. Set `<api|file|web>ServerUrlReference` to the relevant URLs of your ClearML
|
||||
Server installation.
|
||||
Replace the `<ACCESS_KEY>` and `<SECRET_KEY>`with the API credentials you generated earlier.
|
||||
Set the `<api|file|web>ServerUrlReference` fields to match your ClearML
|
||||
Server URLs.
|
||||
:::
|
||||
|
||||
```yaml
|
||||
@ -60,7 +60,7 @@ agentk8sglue:
|
||||
|
||||
### Install the Chart
|
||||
|
||||
Install the ClearML Enterprise Agent Helm chart using the previous values override file:
|
||||
Install the ClearML Enterprise Agent Helm chart:
|
||||
|
||||
```bash
|
||||
helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml
|
||||
@ -68,7 +68,7 @@ helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-e
|
||||
|
||||
## Additional Configuration Options
|
||||
|
||||
To view all configurable options for the Helm chart, run the following command:
|
||||
To view available configuration options for the Helm chart, run the following command:
|
||||
|
||||
```bash
|
||||
helm show readme clearml-enterprise/clearml-enterprise-agent
|
||||
@ -76,7 +76,7 @@ helm show readme clearml-enterprise/clearml-enterprise-agent
|
||||
helm show values clearml-enterprise/clearml-enterprise-agent
|
||||
```
|
||||
|
||||
### Set GPU Availability in Orchestration Dashboard
|
||||
### Reporting GPU Availability to Orchestration Dashboard
|
||||
|
||||
To show GPU availability in the [Orchestration Dashboard](../../webapp/webapp_orchestration_dash.md), explicitly set the number of GPUs:
|
||||
|
||||
@ -88,25 +88,22 @@ agentk8sglue:
|
||||
|
||||
### Queues
|
||||
|
||||
The ClearML Agent monitors ClearML queues and pulls tasks that are scheduled for execution.
|
||||
The ClearML Agent monitors [ClearML queues](../../fundamentals/agents_and_queues.md) and pulls tasks that are
|
||||
scheduled for execution.
|
||||
|
||||
A single agent can monitor multiple queues. By default, the queues share a base pod template (`agentk8sglue.basePodTemplate`)
|
||||
used when submitting a task to Kubernetes after it has been extracted from the queue.
|
||||
A single agent can monitor multiple queues. By default, all queues share a base pod template (`agentk8sglue.basePodTemplate`)
|
||||
used when launching tasks on Kubernetes after it has been pulled from the queue.
|
||||
|
||||
Each queue can be configured with dedicated Pod template spec override (`templateOverrides`). This way queue definitions
|
||||
can be tailored to different use cases.
|
||||
Each queue can be configured to override the base pod template with its own settings with a `templateOverrides` queue template.
|
||||
This way queue definitions can be tailored to different use cases.
|
||||
|
||||
The following are a few examples of agent queue templates:
|
||||
|
||||
#### Example: GPU Queues
|
||||
|
||||
To support GPU queues, you must deploy the NVIDIA GPU Operator on your Kubernetes cluster. For more information, see [GPU Operator](extra_configs/gpu_operator.md).
|
||||
|
||||
GPU queue support requires deploying the NVIDIA GPU Operator on your Kubernetes cluster.
|
||||
|
||||
For more information, see [GPU Operator](extra_configs/gpu_operator.md).
|
||||
|
||||
|
||||
``` yaml
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
createQueues: true
|
||||
queues:
|
||||
@ -122,8 +119,9 @@ agentk8sglue:
|
||||
nvidia.com/gpu: 2
|
||||
```
|
||||
|
||||
#### Example: Overriding Pod Templates per Queue
|
||||
#### Example: Custom Pod Template per Queue
|
||||
|
||||
This example demonstrates how to override the base pod template definitions on a per-queue basis.
|
||||
In this example:
|
||||
|
||||
- The `red` queue inherits both the label `team=red` and the 1Gi memory limit from the `basePodTemplate` section.
|
||||
@ -167,5 +165,5 @@ agentk8sglue:
|
||||
|
||||
## Next Steps
|
||||
|
||||
Once the agent is up and running, proceed with deploying the[ ClearML Enterprise App Gateway](appgw_install_k8s.md).
|
||||
Once the agent is up and running, proceed with deploying the [ClearML Enterprise App Gateway](appgw_install_k8s.md).
|
||||
|
||||
|
@ -2,18 +2,15 @@
|
||||
title: Dynamically Edit Task Pod Template
|
||||
---
|
||||
|
||||
The ClearML Enterprise Agent supports defining custom Python code to modify a task's Pod template before it is applied
|
||||
to Kubernetes.
|
||||
ClearML Agent allows you to inject custom Python code to dynamically modify the Kubernetes Pod template before applying it.
|
||||
|
||||
This enables dynamic customization of Task Pod manifests in the context of a ClearML Enterprise Agent, which is useful
|
||||
for injecting values or changing configurations based on runtime context.
|
||||
|
||||
## Agent Configuration
|
||||
|
||||
The `CLEARML_K8S_GLUE_TEMPLATE_MODULE` environment variable defines the Python module and function inside that
|
||||
module that the ClearML Enterprise Agent should invoke before applying a Task Pod template.
|
||||
module to be invoked by the agent before applying a task pod template.
|
||||
|
||||
The Agent will run this code in its own context, pass arguments (including the actual template) to the function, and use
|
||||
The agent will run this code in its own context, pass arguments (including the actual template) to the function, and use
|
||||
the returned template to create the final Task Pod in Kubernetes.
|
||||
|
||||
Arguments passed to the function include:
|
||||
@ -60,13 +57,13 @@ agentk8sglue:
|
||||
```
|
||||
|
||||
:::note notes
|
||||
* Make sure to include `*args, **kwargs` at the end of the function's argument list and to only use keyword arguments.
|
||||
* Always include `*args, **kwargs` at the end of the function's argument list and only use keyword arguments.
|
||||
This is needed to maintain backward compatibility.
|
||||
|
||||
* Custom code modules can be included as a file in the pod's container, and the environment variable can be used to
|
||||
point to the file and entry point.
|
||||
|
||||
* When defining a custom code module, by default the Agent will start watching pods in all namespaces
|
||||
* When defining a custom code module, by default the agent will start watching pods in all namespaces
|
||||
across the cluster. If you do not intend to give a `ClusterRole` permission, make sure to set the
|
||||
`CLEARML_K8S_GLUE_MONITOR_ALL_NAMESPACES` env to `"0"` to prevent the Agent to try listing pods in all namespaces.
|
||||
Instead, set it to `"1"` if namespace-related changes are needed in the code.
|
||||
@ -80,13 +77,13 @@ agentk8sglue:
|
||||
|
||||
To customize the bash startup scripts instead of the pod spec, use:
|
||||
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
# -- Custom Bash script for the Agent pod ran by Glue Agent
|
||||
customBashScript: ""
|
||||
# -- Custom Bash script for the Task Pods ran by Glue Agent
|
||||
containerCustomBashScript: ""
|
||||
```
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
# -- Custom Bash script for the Agent pod ran by Glue Agent
|
||||
customBashScript: ""
|
||||
# -- Custom Bash script for the Task Pods ran by Glue Agent
|
||||
containerCustomBashScript: ""
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
@ -167,7 +164,7 @@ agentk8sglue:
|
||||
|
||||
### Example: Bind PVC Resource to Task Pod
|
||||
|
||||
In this example, a PVC is created and attached to every Pod created from a dedicated queue, then deleted afterwards.
|
||||
In this example, a PVC is created and attached to every pod created from a dedicated queue, then it is deleted.
|
||||
|
||||
Key points:
|
||||
|
||||
|
@ -2,10 +2,12 @@
|
||||
title: Basic Deployment - Suggested GPU Operator Values
|
||||
---
|
||||
|
||||
This guide provides recommended configuration values for deploying the NVIDIA GPU Operator alongside ClearML Enterprise.
|
||||
|
||||
## Add the Helm Repo Locally
|
||||
|
||||
Add the NVIDIA GPU Operator Helm repository:
|
||||
|
||||
```bash
|
||||
helm repo add nvidia https://nvidia.github.io/gpu-operator
|
||||
```
|
||||
@ -17,10 +19,8 @@ helm repo update
|
||||
|
||||
## Installation
|
||||
|
||||
To prevent unprivileged containers from bypassing the Kubernetes Device Plugin API, configure the GPU operator with the
|
||||
following override values.
|
||||
|
||||
Create a `gpu-operator.override.yaml` file with the following content:
|
||||
To prevent unprivileged containers from bypassing the Kubernetes Device Plugin API, configure the GPU operator
|
||||
using the following `gpu-operator.override.yaml` file:
|
||||
|
||||
```yaml
|
||||
toolkit:
|
||||
@ -53,7 +53,7 @@ helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace
|
||||
|
||||
## Fractional GPU Support
|
||||
|
||||
For support with fractional GPUs, refer to the dedicated guides:
|
||||
To enable fractional GPU allocation or manage mixed GPU configurations, refer to the following guides:
|
||||
* [ClearML Dynamic MIG Operator](../fractional_gpus/cdmo.md) (CDMO) – Dynamically configures MIG GPUs on supported devices.
|
||||
* [ClearML Enterprise Fractional GPU Injector](../fractional_gpus/cfgi.md) (CFGI) – Enables fractional (non-MIG) GPU
|
||||
allocation for better hardware utilization and workload distribution in Kubernetes.
|
||||
|
@ -2,7 +2,8 @@
|
||||
title: Multi-Node Training
|
||||
---
|
||||
|
||||
The ClearML Enterprise Agent supports horizontal multi-node training--running a single Task across multiple Pods on different nodes.
|
||||
The ClearML Enterprise Agent supports horizontal multi-node training, allowing a single Task to run across multiple pods
|
||||
on different nodes.
|
||||
|
||||
Below is a configuration example using `clearml-agent-values.override.yaml`:
|
||||
|
||||
|
@ -7,16 +7,16 @@ users, enabling direct access to cloud-hosted data (e.g., S3) without exposing c
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- A ClearML Enterprise server is up and running.
|
||||
- Generate `<ACCESS_KEY>` and `<SECRET_KEY>` credentials in the ClearML Server. The easiest way is via the ClearML UI
|
||||
(**Settings > Workspace > App Credentials > Create new credentials**).
|
||||
- A ClearML Enterprise Server is up and running.
|
||||
- API credentials (`<ACCESS_KEY>` and `<SECRET_KEY>`) generated via
|
||||
the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). For more information, see [ClearML API Credentials](../../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials).
|
||||
|
||||
:::note
|
||||
Make sure these credentials belong to an admin user or a service user with admin privileges.
|
||||
:::
|
||||
|
||||
- The worker environment must be able to access the ClearML Server over the same network.
|
||||
|
||||
- Token to access `clearml-enterprise` Helm chart repo
|
||||
|
||||
## Installation
|
||||
|
||||
@ -50,7 +50,7 @@ ingress:
|
||||
|
||||
### Deploy the Helm Chart
|
||||
|
||||
Install the `clearml-presign-service` helm chart in the same namespace as the ClearML Enterprise server:
|
||||
Install the `clearml-presign-service` Helm chart in the same namespace as the ClearML Enterprise server:
|
||||
|
||||
```bash
|
||||
helm install -n clearml clearml-presign-service clearml-enterprise/clearml-presign-service -f presign-service.override.yaml
|
||||
|
@ -2,10 +2,8 @@
|
||||
title: ClearML Tenant with Self Signed Certificates
|
||||
---
|
||||
|
||||
This guide covers the configuration to support SSL Custom certificates for the following components:
|
||||
|
||||
- ClearML Enterprise AI Application Gateway
|
||||
- ClearML Enterprise Agent
|
||||
This guide covers how to configure the [AI Application Gateway](#ai-application-gateway) and [ClearML Agent](#clearml-agent)
|
||||
to use self-signed or custom SSL certificates.
|
||||
|
||||
## AI Application Gateway
|
||||
|
||||
@ -25,15 +23,15 @@ customCertificates:
|
||||
-----END CERTIFICATE-----
|
||||
```
|
||||
|
||||
In this section, there are two options:
|
||||
You have two configuration options:
|
||||
|
||||
- [**Replace**](#replace-the-whole-ca-certificatescrt-file) the entire `ca-certificates.crt` file
|
||||
- [**Replace**](#replace-entire-ca-certificatescrt-file) the entire `ca-certificates.crt` file
|
||||
- [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt) extra certificates to the existing `ca-certificates.crt`
|
||||
|
||||
|
||||
### Replace Entire `ca-certificates.crt` File
|
||||
|
||||
To replace the whole ca-bundle, you can attach a concatenation of all your valid CA in a `pem` format as
|
||||
To replace the whole ca-bundle, provide a concatenated list of all trusted CA certificates in `pem` format as
|
||||
they are stored in a standard `ca-certificates.crt`.
|
||||
|
||||
```yaml
|
||||
@ -55,7 +53,7 @@ customCertificates:
|
||||
|
||||
### Append Extra Certificates to the Existing `ca-certificates.crt`
|
||||
|
||||
You can add certificates to the existing CA bundle. Ensure each certificate has a unique `alias`.
|
||||
You can add certificates to the existing CA bundle. Each certificate must have a unique `alias`.
|
||||
|
||||
```yaml
|
||||
# -- Custom certificates
|
||||
@ -82,9 +80,9 @@ To apply the changes, run the update command:
|
||||
helm upgrade -i <RELEASE_NAME> -n <WORKLOAD_NAMESPACE> clearml-enterprise/clearml-enterprise-app-gateway --version <CHART_VERSION> -f clearml-app-gateway-values.override.yaml
|
||||
```
|
||||
|
||||
## ClearML Enterprise Agent
|
||||
## ClearML Agent
|
||||
|
||||
For the Agent, configure certificates in the `clearml-agent-values.override.yaml` file:
|
||||
For the ClearML Agent, configure certificates in the `clearml-agent-values.override.yaml` file:
|
||||
|
||||
```yaml
|
||||
# -- Custom certificates
|
||||
@ -100,17 +98,18 @@ customCertificates:
|
||||
-----END CERTIFICATE-----
|
||||
```
|
||||
|
||||
In the section, there are two options:
|
||||
You have two configuration options:
|
||||
|
||||
- [**Replace**](#replace-the-whole-ca-certificatescrt-file) the entire `ca-certificates.crt` file
|
||||
- [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt) extra certificates to the existing `ca-certificates.crt`
|
||||
- [**Replace**](#replace-entire-ca-certificatescrt-file-1) the entire `ca-certificates.crt` file
|
||||
- [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt-1) extra certificates to the existing `ca-certificates.crt`
|
||||
|
||||
|
||||
### Replace Entire `ca-certificates.crt` File
|
||||
|
||||
If you need to replace the whole ca-bundle you can attach a concatenation of all your valid CA in a `pem` format like
|
||||
To replace the whole ca-bundle, provide a concatenated list of all trusted CA certificates in `pem` format as
|
||||
they are stored in a standard `ca-certificates.crt`.
|
||||
|
||||
|
||||
```yaml
|
||||
# -- Custom certificates
|
||||
customCertificates:
|
||||
@ -130,7 +129,7 @@ customCertificates:
|
||||
|
||||
### Append Extra Certificates to the Existing `ca-certificates.crt`
|
||||
|
||||
You can add certificates to the existing CA bundle. Ensure each certificate has a unique `alias`.
|
||||
You can add certificates to the existing CA bundle. Each certificate must have a unique `alias`.
|
||||
|
||||
```yaml
|
||||
# -- Custom certificates
|
||||
@ -151,7 +150,7 @@ customCertificates:
|
||||
|
||||
### Add Certificates to Task Pods
|
||||
|
||||
If your workloads need access to these certificates (e.g., for HTTPS requests), configure the agent to inject them into Pods:
|
||||
If your workloads need access to these certificates (e.g., for HTTPS requests), configure the agent to inject them into pods:
|
||||
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
@ -195,7 +194,7 @@ Their names are usually prefixed with the Helm release name, so adjust according
|
||||
|
||||
### Apply Changes
|
||||
|
||||
Applying the changes by running the the update command:
|
||||
Apply the changes by running the update command:
|
||||
|
||||
``` bash
|
||||
helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml
|
||||
|
@ -3,7 +3,8 @@ title: SSO (Identity Provider) Setup
|
||||
---
|
||||
|
||||
ClearML Enterprise Server supports various Single Sign-On (SSO) identity providers.
|
||||
SSO configuration is managed via environment variables in your `clearml-values.override.yaml` file and applied to the `apiserver` component.
|
||||
SSO configuration is managed via environment variables in your `clearml-values.override.yaml` file and is applied to the
|
||||
`apiserver` component.
|
||||
|
||||
The following are configuration examples for commonly used providers. Other supported systems include:
|
||||
* Auth0
|
||||
@ -11,7 +12,7 @@ The following are configuration examples for commonly used providers. Other supp
|
||||
* Okta
|
||||
* Azure AD
|
||||
* Google
|
||||
* and AWS Cognito
|
||||
* AWS Cognito
|
||||
|
||||
## Auth0
|
||||
|
||||
@ -56,17 +57,17 @@ apiserver:
|
||||
To map Keycloak groups into the ClearML user's SSO token:
|
||||
|
||||
1. Go to the **Client Scopes** tab.
|
||||
1. Click on the first row `<clearml client>-dedicated`.
|
||||
1. Click **Add Mapper > By configuration > Group membership**
|
||||
1. In the dialog:
|
||||
* select the **Name** "groups"
|
||||
1. Click on the `<clearml client>-dedicated` scope.
|
||||
1. Click **Add Mapper > By Configuration > Group Membership**
|
||||
1. Configure the mapper:
|
||||
* Select the **Name** "groups"
|
||||
* Set **Token Claim Name** "groups"
|
||||
* Uncheck the **Full group path**
|
||||
* Save the mapper.
|
||||
|
||||
To verify:
|
||||
|
||||
1. Return to **Client Details > Client scope** tab.
|
||||
1. Go to the Evaluate sub-tab and select a user who has any group memberships.
|
||||
1. Go to **Generated ID token** and then to **Generated User Info**.
|
||||
1Inspect that in both cases you can see the group's claim in the displayed user data.
|
||||
1. Go to the **Client Details > Client scope** tab.
|
||||
1. Go to the **Evaluate** sub-tab and select a user with any group memberships.
|
||||
1. Go to **Generated ID Token** and then to **Generated User Info**.
|
||||
1. Inspect that in both cases you can see the group's claim in the displayed user data.
|
||||
|
@ -2,14 +2,12 @@
|
||||
title: ClearML Dynamic MIG Operator (CDMO)
|
||||
---
|
||||
|
||||
The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG GPU configurations.
|
||||
The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG (Multi-Instance GPU) configurations.
|
||||
|
||||
## Installation
|
||||
|
||||
### Requirements
|
||||
|
||||
* Install the official NVIDIA `gpu-operator` using Helm with one of the following configurations.
|
||||
|
||||
* Add and update the Nvidia Helm repo:
|
||||
|
||||
```bash
|
||||
@ -46,7 +44,7 @@ The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG GPU configurations.
|
||||
value: all
|
||||
```
|
||||
|
||||
* Install the official NVIDIA `gpu-operator` using Helm with the previous configuration:
|
||||
* Install the NVIDIA `gpu-operator` using Helm with the previous configuration:
|
||||
|
||||
```bash
|
||||
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
|
||||
@ -54,33 +52,33 @@ The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG GPU configurations.
|
||||
|
||||
### Installing CDMO
|
||||
|
||||
* Create a `cdmo-values.override.yaml` file with the following content:
|
||||
1. Create a `cdmo-values.override.yaml` file with the following content:
|
||||
|
||||
```yaml
|
||||
imageCredentials:
|
||||
password: "<CLEARML_DOCKERHUB_TOKEN>"
|
||||
```
|
||||
|
||||
* Install the CDMO Helm Chart using the previous override file:
|
||||
1. Install the CDMO Helm Chart using the previous override file:
|
||||
|
||||
```bash
|
||||
helm install -n cdmo cdmo clearml-enterprise/clearml-dynamic-mig-operator --create-namespace -f cdmo-values.override.yaml
|
||||
```
|
||||
|
||||
* Enable the NVIDIA MIG support on your cluster by running the following command on all nodes with a MIG-supported GPU
|
||||
1. Enable the NVIDIA MIG support on your cluster by running the following command on all nodes with a MIG-supported GPU
|
||||
(run it for each GPU `<GPU_ID>` on the host):
|
||||
|
||||
```bash
|
||||
nvidia-smi -mig 1
|
||||
```
|
||||
|
||||
:::note notes
|
||||
* The node reboot may be required if the command output indicates so.
|
||||
:::note notes
|
||||
* A node reboot may be required if the command output indicates so.
|
||||
|
||||
* For convenience, this command can be run from within the `nvidia-device-plugin-daemonset` pod running on the related node.
|
||||
:::
|
||||
|
||||
* For convenience, this command can be issued from inside the `nvidia-device-plugin-daemonset` pod running on the related node.
|
||||
:::
|
||||
|
||||
* Any MIG-enabled GPU node `<NODE_NAME>` from the last point must be labeled accordingly as follows:
|
||||
1. Label all MIG-enabled GPU node `<NODE_NAME>` from the previous step:
|
||||
|
||||
```bash
|
||||
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning=mig"
|
||||
@ -88,7 +86,7 @@ The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG GPU configurations.
|
||||
|
||||
## Disabling MIGs
|
||||
|
||||
To disable MIG, follow these steps:
|
||||
To disable MIG mode and restore standard full-GPU access:
|
||||
|
||||
1. Ensure no running workflows are using GPUs on the target node(s).
|
||||
|
||||
@ -108,7 +106,7 @@ To disable MIG, follow these steps:
|
||||
nvidia-smi -mig 0
|
||||
```
|
||||
|
||||
4. Edit the `gpu-operator.override.yaml` file to have a standard configuration for full GPUs, and upgrade the `gpu-operator`:
|
||||
4. Edit the `gpu-operator.override.yaml` file to restore full-GPU access, and upgrade the `gpu-operator`:
|
||||
|
||||
```yaml
|
||||
toolkit:
|
||||
|
@ -1,7 +1,8 @@
|
||||
---
|
||||
title: Install CDMO and CFGI on the same Cluster
|
||||
title: Install CDMO and CFGI on the Same Cluster
|
||||
---
|
||||
|
||||
You can install both CDMO (ClearML Dynamic MIG Orchestrator) and CFGI (ClearML Fractional GPU Injector) on a shared Kubernetes cluster.
|
||||
In clusters with multiple nodes and varying GPU types, the `gpu-operator` can be used to manage different device configurations
|
||||
and fractioning modes.
|
||||
|
||||
@ -11,7 +12,7 @@ The NVIDIA `gpu-operator` supports defining multiple configurations for the Devi
|
||||
|
||||
The following example YAML defines two configurations: "mig" and "ts" (time-slicing).
|
||||
|
||||
``` yaml
|
||||
```yaml
|
||||
migManager:
|
||||
enabled: false
|
||||
mig:
|
||||
@ -69,24 +70,15 @@ devicePlugin:
|
||||
|
||||
## Applying Configuration to Nodes
|
||||
|
||||
To activate a configuration, label the Kubernetes node accordingly. After a node is labeled,
|
||||
the NVIDIA `device-plugin` will automatically reload the new configuration.
|
||||
Label each Kubernetes node accordingly to activate a specific GPU mode:
|
||||
|
||||
Example usage:
|
||||
* Apply the `mig` (MIG mode) config:
|
||||
``` bash
|
||||
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=mig
|
||||
```
|
||||
|Mode| Label command|
|
||||
|----|-----|
|
||||
| `mig` | `kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=mig` |
|
||||
| `ts` (time slicing) | `kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=ts` |
|
||||
| Standard full-GPU access | `kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled` |
|
||||
|
||||
* Apply the `ts` (time slicing) config:
|
||||
``` bash
|
||||
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=ts
|
||||
```
|
||||
|
||||
* Apply the `all-disabled` (standard full GPU access) config:
|
||||
``` bash
|
||||
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled
|
||||
```
|
||||
After a node is labeled, the NVIDIA `device-plugin` will automatically reload the new configuration.
|
||||
|
||||
## Installing CDMO and CFGI
|
||||
|
||||
@ -97,22 +89,26 @@ and [CFGI](cfgi.md).
|
||||
|
||||
### Time Slicing
|
||||
|
||||
To switch between time-slicing and full GPU access, update the node label using the `--overwrite` flag:
|
||||
To disable time-slicing and use full GPU access, update the node label using the `--overwrite` flag:
|
||||
|
||||
```bash
|
||||
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled --overwrite
|
||||
```
|
||||
|
||||
### MIG
|
||||
|
||||
To disable MIG mode:
|
||||
|
||||
1. Ensure there are no more running workflows requesting any form of GPU on the node(s) before re-configuring it.
|
||||
2. Remove the CDMO label from the target node(s) to disable the dynamic MIG reconfiguration.
|
||||
1. Ensure there are no more running workflows requesting any form of GPU on the node(s).
|
||||
2. Remove the CDMO label from the target node(s).
|
||||
|
||||
``` bash
|
||||
```bash
|
||||
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning-"
|
||||
```
|
||||
|
||||
3. Execute a shell in the `device-plugin-daemonset` Pod instance running on the target node(s) and execute the following commands:
|
||||
3. Execute a shell in the `device-plugin-daemonset` pod instance running on the target node(s) and execute the following commands:
|
||||
|
||||
``` bash
|
||||
```bash
|
||||
nvidia-smi mig -dci
|
||||
|
||||
nvidia-smi mig -dgi
|
||||
@ -120,8 +116,8 @@ To disable MIG mode:
|
||||
nvidia-smi -mig 0
|
||||
```
|
||||
|
||||
4. Relabel the target node to disable MIG:
|
||||
4. Label the node to use standard (non-MIG) GPU mode:
|
||||
|
||||
``` bash
|
||||
```bash
|
||||
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled --overwrite
|
||||
```
|
@ -2,34 +2,36 @@
|
||||
title: ClearML Fractional GPU Injector (CFGI)
|
||||
---
|
||||
|
||||
The **ClearML Enterprise Fractional GPU Injector** (CFGI) allows AI workloads to run on Kubernetes using non-MIG GPU
|
||||
fractions, optimizing both hardware utilization and performance.
|
||||
The **ClearML Enterprise Fractional GPU Injector** (CFGI) allows AI workloads to utilize fractional (non-MIG) GPU slices
|
||||
on Kubernetes clusters, maximizing hardware efficiency and performance.
|
||||
|
||||
## Installation
|
||||
|
||||
### Add the Local ClearML Helm Repository
|
||||
|
||||
``` bash
|
||||
```bash
|
||||
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <GITHUB_TOKEN> --password <GITHUB_TOKEN>
|
||||
helm repo update
|
||||
```
|
||||
|
||||
### Requirements
|
||||
|
||||
* Install the official NVIDIA `gpu-operator` using Helm with one of the following configurations.
|
||||
* The number of slices must be 8.
|
||||
* Install the NVIDIA `gpu-operator` using Helm
|
||||
* Set the number of GPU slices to 8
|
||||
* Add and update the Nvidia Helm repo:
|
||||
|
||||
``` bash
|
||||
```bash
|
||||
helm repo add nvidia https://nvidia.github.io/gpu-operator
|
||||
helm repo update
|
||||
```
|
||||
|
||||
* Credentials for the ClearML Enterprise DockerHub repository
|
||||
|
||||
#### GPU Operator Configuration
|
||||
### GPU Operator Configuration
|
||||
|
||||
##### For CFGI Version >= 1.3.0
|
||||
#### For CFGI Version >= 1.3.0
|
||||
|
||||
1. Create a docker-registry secret named `clearml-dockerhub-access` in the `gpu-operator` Namespace, making sure to replace your `<CLEARML_DOCKERHUB_TOKEN>`:
|
||||
1. Create a Docker Registry secret named `clearml-dockerhub-access` in the `gpu-operator` namespace. Make sure to replace `<CLEARML_DOCKERHUB_TOKEN>` with your token.
|
||||
|
||||
```bash
|
||||
kubectl create secret -n gpu-operator docker-registry clearml-dockerhub-access \
|
||||
@ -101,11 +103,11 @@ devicePlugin:
|
||||
replicas: 8
|
||||
```
|
||||
|
||||
##### For CFGI version < 1.3.0 (Legacy GPU Operator)
|
||||
#### For CFGI version < 1.3.0 (Legacy)
|
||||
|
||||
Create a `gpu-operator.override.yaml` file:
|
||||
|
||||
``` yaml
|
||||
```yaml
|
||||
toolkit:
|
||||
env:
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
|
||||
@ -144,26 +146,26 @@ devicePlugin:
|
||||
replicas: 8
|
||||
```
|
||||
|
||||
### Install
|
||||
### Install GPU Operator and CFGI
|
||||
|
||||
Install the nvidia `gpu-operator` using the previously created `gpu-operator.override.yaml` override file:
|
||||
1. Install the NVIDIA `gpu-operator` using the previously created `gpu-operator.override.yaml` file:
|
||||
|
||||
```bash
|
||||
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
|
||||
```
|
||||
```bash
|
||||
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
|
||||
```
|
||||
|
||||
Create a `cfgi-values.override.yaml` file with the following content:
|
||||
1. Create a `cfgi-values.override.yaml` file with the following content:
|
||||
|
||||
```yaml
|
||||
imageCredentials:
|
||||
password: "<CLEARML_DOCKERHUB_TOKEN>"
|
||||
```
|
||||
```yaml
|
||||
imageCredentials:
|
||||
password: "<CLEARML_DOCKERHUB_TOKEN>"
|
||||
```
|
||||
|
||||
Install the CFGI Helm Chart using the previous override file:
|
||||
1. Install the CFGI Helm Chart using the previous override file:
|
||||
|
||||
```bash
|
||||
helm install -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector --create-namespace -f cfgi-values.override.yaml
|
||||
```
|
||||
```bash
|
||||
helm install -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector --create-namespace -f cfgi-values.override.yaml
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
@ -187,9 +189,9 @@ Valid values for `"<GPU_FRACTION_VALUE>"` include:
|
||||
* "0.875"
|
||||
* Integer representation of GPUs such as `1.000`, `2`, `2.0`, etc.
|
||||
|
||||
### ClearML Enterprise Agent Configuration
|
||||
### ClearML Agent Configuration
|
||||
|
||||
To run ClearML jobs that request specific GPU fractions, configure the queues in your `clearml-agent-values.override.yaml` file.
|
||||
To run ClearML jobs with fractional GPU allocation, configure your queues in accordingly in your `clearml-agent-values.override.yaml` file.
|
||||
|
||||
Each queue should include a `templateOverride` that sets the `clearml-injector/fraction` label, which determines the
|
||||
fraction of a GPU to allocate (e.g., "0.500" for half a GPU).
|
||||
@ -259,16 +261,16 @@ agentk8sglue:
|
||||
nvidia.com/gpu: 1
|
||||
```
|
||||
|
||||
## Upgrading Chart
|
||||
## Upgrading CFGI Chart
|
||||
|
||||
To upgrade to the latest version of this chart:
|
||||
To upgrade to the latest chart version:
|
||||
|
||||
```bash
|
||||
helm repo update
|
||||
helm upgrade -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector
|
||||
```
|
||||
|
||||
To apply changes to values on an existing installation:
|
||||
To apply new values to an existing installation:
|
||||
|
||||
```bash
|
||||
helm upgrade -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector -f cfgi-values.override.yaml
|
||||
|
Loading…
Reference in New Issue
Block a user