clearml-docs/cdmo.md at ac733ba3f02e7c2eea50d9e8b71809e52dbfc413

mirror of https://github.com/clearml/clearml-docs synced 2025-06-26 18:17:44 +00:00

fbrintazzoli 41e455f46c Added: newline

2025-05-21 10:08:22 +02:00

3.8 KiB

Raw Blame History

title
Managing GPU Fractions with ClearML Dynamic MIG Operator (CDMO)

This guide covers using GPU fractions in Kubernetes clusters using NVIDIA MIGs and ClearML's Dynamic MIG Operator (CDMO). CDMO enables dynamic MIG (Multi-Instance GPU) configurations.

This guide covers:

Installing CDMO
Enabling MIG mode on your cluster
Managing GPU partitioning dynamically

Installation

Requirements

Add and update the Nvidia Helm repo:

helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update

Create a gpu-operator.override.yaml file with the following content:

migManager:
  enabled: false
mig:
  strategy: mixed
toolkit:
  env:
    - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
      value: "false"
    - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
      value: "true"
devicePlugin:
  env:
    - name: PASS_DEVICE_SPECS
      value: "true"
    - name: FAIL_ON_INIT_ERROR
      value: "true"
    - name: DEVICE_LIST_STRATEGY # Use volume-mounts
      value: volume-mounts
    - name: DEVICE_ID_STRATEGY
      value: uuid
    - name: NVIDIA_VISIBLE_DEVICES
      value: all
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: all

Install the NVIDIA gpu-operator using Helm with the previous configuration:

helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml

Installing CDMO

Create a cdmo-values.override.yaml file with the following content:
```
imageCredentials:
  password: "<CLEARML_DOCKERHUB_TOKEN>"
```

Install the CDMO Helm Chart using the previous override file:

helm install -n cdmo cdmo clearml-enterprise/clearml-dynamic-mig-operator --create-namespace -f cdmo-values.override.yaml

Enable the NVIDIA MIG support on your cluster by running the following command on all nodes with a MIG-supported GPU (run it for each GPU <GPU_ID> on the host):
```
nvidia-smi -mig 1
```
:::note notes
- A node reboot may be required if the command output indicates so.
- For convenience, this command can be run from within the nvidia-device-plugin-daemonset pod running on the related node. :::

Label all MIG-enabled GPU nodes <NODE_NAME> from the previous step:

kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning=mig"

Disabling MIGs

To disable MIG mode and restore standard full-GPU access:

Ensure no running workflows are using GPUs on the target node(s).
Remove the CDMO label from the target node(s) to disable the dynamic MIG reconfiguration.
```
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning-"
```
Execute a shell into the device-plugin-daemonset pod instance running on the target node(s) and execute the following commands:
```
nvidia-smi mig -dci

nvidia-smi mig -dgi

nvidia-smi -mig 0
```

Edit the gpu-operator.override.yaml file to restore full-GPU access:

toolkit:
env:
    - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
      value: "false"
    - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
      value: "true"
devicePlugin:
env:
    - name: PASS_DEVICE_SPECS
      value: "true"
    - name: FAIL_ON_INIT_ERROR
      value: "true"
    - name: DEVICE_LIST_STRATEGY # Use volume-mounts
      value: volume-mounts
    - name: DEVICE_ID_STRATEGY
      value: uuid
    - name: NVIDIA_VISIBLE_DEVICES
      value: all
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: all

Upgrade the gpu-operator:

helm upgrade -n gpu-operator gpu-operator nvidia/gpu-operator -f gpu-operator.override.yaml

3.8 KiB Raw Blame History

Installation

Requirements

Installing CDMO

Disabling MIGs

3.8 KiB

Raw Blame History