ðŸŸ¢ Ready

ClearML Dynamic MIG Operator (CDMO)

Enables dynamic MIG GPU configurations.

Installation

Requirements

Install the official NVIDIA gpu-operator using Helm with one of the following configurations.

Add and update the Nvidia Helm repo:

helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update

Create a gpu-operator.override.yaml file with the following content:

migManager:
  enabled: false
mig:
  strategy: mixed
toolkit:
  env:
    - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
      value: "false"
    - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
      value: "true"
devicePlugin:
  env:
    - name: PASS_DEVICE_SPECS
      value: "true"
    - name: FAIL_ON_INIT_ERROR
      value: "true"
    - name: DEVICE_LIST_STRATEGY # Use volume-mounts
      value: volume-mounts
    - name: DEVICE_ID_STRATEGY
      value: uuid
    - name: NVIDIA_VISIBLE_DEVICES
      value: all
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: all

Install the official NVIDIA gpu-operator using Helm with the previous configuration:

helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml

Install

Create a cdmo-values.override.yaml file with the following content:

imageCredentials:
  password: "<CLEARML_DOCKERHUB_TOKEN>"

Install the CDMO operator Helm Chart using the previous override file:

helm install -n cdmo cdmo clearml-enterprise/clearml-dynamic-mig-operator --create-namespace -f cdmo-values.override.yaml

Enable the NVIDIA MIG support on your cluster by running the following command on all Nodes with a MIG-supported GPU (run it for each GPU <GPU_ID> you have on the Host):

nvidia-smi -mig 1

NOTE: The node might need to be rebooted if reported by the result of the previous command.

NOTE: For convenience, this command can be issued from inside the nvidia-device-plugin-daemonset Pod running on the related node.

Any MIG-enabled GPU node <NODE_NAME> from the last point must be labeled accordingly as follows:

kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning=mig"

Unconfigure MIGs

For disabling MIG, follow these steps in order:

Ensure there are no more running workflows requesting any form of GPU on the Node(s) before re-configuring it.
Remove the CDMO label from the target Node(s) to disable the dynamic MIG reconfiguration.
```
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning-"
```
Execute a shell into the device-plugin-daemonset Pod instance running on the target Node(s) and execute the following commands in order:
```
nvidia-smi mig -dci

nvidia-smi mig -dgi

nvidia-smi -mig 0
```

Edit the gpu-operator.override.yaml file to have a standard configuration for full GPUs and upgrade the gpu-operator:

toolkit:
env:
    - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
    value: "false"
    - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
    value: "true"
devicePlugin:
env:
    - name: PASS_DEVICE_SPECS
    value: "true"
    - name: FAIL_ON_INIT_ERROR
    value: "true"
    - name: DEVICE_LIST_STRATEGY # Use volume-mounts
    value: volume-mounts
    - name: DEVICE_ID_STRATEGY
    value: uuid
    - name: NVIDIA_VISIBLE_DEVICES
    value: all
    - name: NVIDIA_DRIVER_CAPABILITIES
    value: all

3.5 KiB Raw Blame History