clearml-docs/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo.md
2025-05-18 13:48:11 +03:00

3.5 KiB

title
ClearML Dynamic MIG Operator (CDMO)

The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG (Multi-Instance GPU) configurations.

Installation

Requirements

  • Add and update the Nvidia Helm repo:

    helm repo add nvidia https://nvidia.github.io/gpu-operator
    helm repo update
    
  • Create a gpu-operator.override.yaml file with the following content:

    migManager:
      enabled: false
    mig:
      strategy: mixed
    toolkit:
      env:
        - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
          value: "false"
        - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
          value: "true"
    devicePlugin:
      env:
        - name: PASS_DEVICE_SPECS
          value: "true"
        - name: FAIL_ON_INIT_ERROR
          value: "true"
        - name: DEVICE_LIST_STRATEGY # Use volume-mounts
          value: volume-mounts
        - name: DEVICE_ID_STRATEGY
          value: uuid
        - name: NVIDIA_VISIBLE_DEVICES
          value: all
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: all
    
  • Install the NVIDIA gpu-operator using Helm with the previous configuration:

    helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
    

Installing CDMO

  1. Create a cdmo-values.override.yaml file with the following content:

    imageCredentials:
      password: "<CLEARML_DOCKERHUB_TOKEN>"
    
  2. Install the CDMO Helm Chart using the previous override file:

    helm install -n cdmo cdmo clearml-enterprise/clearml-dynamic-mig-operator --create-namespace -f cdmo-values.override.yaml
    
  3. Enable the NVIDIA MIG support on your cluster by running the following command on all nodes with a MIG-supported GPU (run it for each GPU <GPU_ID> on the host):

    nvidia-smi -mig 1
    

    :::note notes

    • A node reboot may be required if the command output indicates so.

    • For convenience, this command can be run from within the nvidia-device-plugin-daemonset pod running on the related node. :::

  4. Label all MIG-enabled GPU node <NODE_NAME> from the previous step:

    kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning=mig"
    

Disabling MIGs

To disable MIG mode and restore standard full-GPU access:

  1. Ensure no running workflows are using GPUs on the target node(s).

  2. Remove the CDMO label from the target node(s) to disable the dynamic MIG reconfiguration.

    kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning-"
    
  3. Execute a shell into the device-plugin-daemonset pod instance running on the target node(s) and execute the following commands:

    nvidia-smi mig -dci
    
    nvidia-smi mig -dgi
    
    nvidia-smi -mig 0
    
  4. Edit the gpu-operator.override.yaml file to restore full-GPU access, and upgrade the gpu-operator:

    toolkit:
    env:
        - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
          value: "false"
        - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
          value: "true"
    devicePlugin:
    env:
        - name: PASS_DEVICE_SPECS
          value: "true"
        - name: FAIL_ON_INIT_ERROR
          value: "true"
        - name: DEVICE_LIST_STRATEGY # Use volume-mounts
          value: volume-mounts
        - name: DEVICE_ID_STRATEGY
          value: uuid
        - name: NVIDIA_VISIBLE_DEVICES
          value: all
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: all