clearml-docs/cdmo.md at 3bdca8ee6fc445ea2bf1922211c99ed233c0f40b

mirror of https://github.com/clearml/clearml-docs synced 2025-05-28 09:08:41 +00:00

2025-05-18 13:48:11 +03:00

3.5 KiB

Raw Blame History

title
ClearML Dynamic MIG Operator (CDMO)

The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG (Multi-Instance GPU) configurations.

Installation

Requirements

Add and update the Nvidia Helm repo:

helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update

Create a gpu-operator.override.yaml file with the following content:

migManager:
  enabled: false
mig:
  strategy: mixed
toolkit:
  env:
    - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
      value: "false"
    - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
      value: "true"
devicePlugin:
  env:
    - name: PASS_DEVICE_SPECS
      value: "true"
    - name: FAIL_ON_INIT_ERROR
      value: "true"
    - name: DEVICE_LIST_STRATEGY # Use volume-mounts
      value: volume-mounts
    - name: DEVICE_ID_STRATEGY
      value: uuid
    - name: NVIDIA_VISIBLE_DEVICES
      value: all
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: all

Install the NVIDIA gpu-operator using Helm with the previous configuration:

helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml

Installing CDMO

Create a cdmo-values.override.yaml file with the following content:
```
imageCredentials:
  password: "<CLEARML_DOCKERHUB_TOKEN>"
```

Install the CDMO Helm Chart using the previous override file:

helm install -n cdmo cdmo clearml-enterprise/clearml-dynamic-mig-operator --create-namespace -f cdmo-values.override.yaml

Enable the NVIDIA MIG support on your cluster by running the following command on all nodes with a MIG-supported GPU (run it for each GPU <GPU_ID> on the host):
```
nvidia-smi -mig 1
```
:::note notes
- A node reboot may be required if the command output indicates so.
- For convenience, this command can be run from within the nvidia-device-plugin-daemonset pod running on the related node. :::

Label all MIG-enabled GPU node <NODE_NAME> from the previous step:

kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning=mig"

Disabling MIGs

To disable MIG mode and restore standard full-GPU access:

Ensure no running workflows are using GPUs on the target node(s).
Remove the CDMO label from the target node(s) to disable the dynamic MIG reconfiguration.
```
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning-"
```
Execute a shell into the device-plugin-daemonset pod instance running on the target node(s) and execute the following commands:
```
nvidia-smi mig -dci

nvidia-smi mig -dgi

nvidia-smi -mig 0
```

Edit the gpu-operator.override.yaml file to restore full-GPU access, and upgrade the gpu-operator:

toolkit:
env:
    - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
      value: "false"
    - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
      value: "true"
devicePlugin:
env:
    - name: PASS_DEVICE_SPECS
      value: "true"
    - name: FAIL_ON_INIT_ERROR
      value: "true"
    - name: DEVICE_LIST_STRATEGY # Use volume-mounts
      value: volume-mounts
    - name: DEVICE_ID_STRATEGY
      value: uuid
    - name: NVIDIA_VISIBLE_DEVICES
      value: all
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: all

3.5 KiB Raw Blame History

Installation

Requirements

Installing CDMO

Disabling MIGs

3.5 KiB

Raw Blame History