3.8 KiB
title |
---|
Managing GPU Fractions with ClearML Dynamic MIG Operator (CDMO) |
This guide covers using GPU fractions in Kubernetes clusters using NVIDIA MIGs and ClearML's Dynamic MIG Operator (CDMO). CDMO enables dynamic MIG (Multi-Instance GPU) configurations.
This guide covers:
- Installing CDMO
- Enabling MIG mode on your cluster
- Managing GPU partitioning dynamically
Installation
Requirements
-
Add and update the Nvidia Helm repo:
helm repo add nvidia https://nvidia.github.io/gpu-operator helm repo update
-
Create a
gpu-operator.override.yaml
file with the following content:migManager: enabled: false mig: strategy: mixed toolkit: env: - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED value: "false" - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS value: "true" devicePlugin: env: - name: PASS_DEVICE_SPECS value: "true" - name: FAIL_ON_INIT_ERROR value: "true" - name: DEVICE_LIST_STRATEGY # Use volume-mounts value: volume-mounts - name: DEVICE_ID_STRATEGY value: uuid - name: NVIDIA_VISIBLE_DEVICES value: all - name: NVIDIA_DRIVER_CAPABILITIES value: all
-
Install the NVIDIA
gpu-operator
using Helm with the previous configuration:helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
Installing CDMO
-
Create a
cdmo-values.override.yaml
file with the following content:imageCredentials: password: "<CLEARML_DOCKERHUB_TOKEN>"
-
Install the CDMO Helm Chart using the previous override file:
helm install -n cdmo cdmo clearml-enterprise/clearml-dynamic-mig-operator --create-namespace -f cdmo-values.override.yaml
-
Enable the NVIDIA MIG support on your cluster by running the following command on all nodes with a MIG-supported GPU (run it for each GPU
<GPU_ID>
on the host):nvidia-smi -mig 1
:::note notes
-
A node reboot may be required if the command output indicates so.
-
For convenience, this command can be run from within the
nvidia-device-plugin-daemonset
pod running on the related node. :::
-
-
Label all MIG-enabled GPU nodes
<NODE_NAME>
from the previous step:kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning=mig"
Disabling MIGs
To disable MIG mode and restore standard full-GPU access:
-
Ensure no running workflows are using GPUs on the target node(s).
-
Remove the CDMO label from the target node(s) to disable the dynamic MIG reconfiguration.
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning-"
-
Execute a shell into the
device-plugin-daemonset
pod instance running on the target node(s) and execute the following commands:nvidia-smi mig -dci nvidia-smi mig -dgi nvidia-smi -mig 0
-
Edit the
gpu-operator.override.yaml
file to restore full-GPU access:toolkit: env: - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED value: "false" - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS value: "true" devicePlugin: env: - name: PASS_DEVICE_SPECS value: "true" - name: FAIL_ON_INIT_ERROR value: "true" - name: DEVICE_LIST_STRATEGY # Use volume-mounts value: volume-mounts - name: DEVICE_ID_STRATEGY value: uuid - name: NVIDIA_VISIBLE_DEVICES value: all - name: NVIDIA_DRIVER_CAPABILITIES value: all
-
Upgrade the
gpu-operator
:helm upgrade -n gpu-operator gpu-operator nvidia/gpu-operator -f gpu-operator.override.yaml