Add Fractional GPU information (#872)

2025-06-26 18:17:44 +00:00 · 2024-07-15 10:48:33 +03:00 · 2024-07-15 10:48:33 +03:00 · b31452f6a1
commit b31452f6a1
parent d608881c4f
3 changed files with 358 additions and 49 deletions
--- a/docs/clearml_agent.md
+++ b/docs/clearml_agent.md
@ -331,55 +331,6 @@ template)
 * Inside each job pod the `clearml-agent` will install the ClearML task's environment and run and monitor the experiment's 
 process

-#### Fractional GPUs
-Some jobs that you send for execution need a minimal amount of compute and memory, but you end up allocating entire GPUs 
-to them. In order to optimize your compute resource usage, you can partition GPUs into slices. 
-
-Set up MIG support for Kubernetes through your NVIDIA device plugin, and define the GPU fractions to be made available 
-to the cluster. 
-
-The ClearML Agent Helm chart lets you specify a pod template for each queue which describes the resources that the pod
-will use. The template should specify the requested GPU slices under `Containers.resources.limits` to have the queue use 
-the defined resources. For example, the following configures a K8s pod to run a 3g.20gb MIG device:
-
-```
-# tf-benchmarks-mixed.yaml
-apiVersion: v1
-kind: Pod
-metadata:
-  name: tf-benchmarks-mixed
-spec:
-  restartPolicy: Never
-  Containers:
-     - name: tf-benchmarks-mixed
-     image: ""
-      command: []
-      args: []
-      resources:
-        limits:
-          nvidia.com/mig-3g.20gb: 1
-  nodeSelector:  #optional
-    nvidia.com/gpu.product: A100-SXM4-40GB
-```
-
-When tasks are added to the relevant queue, the agent pulls the task and creates a pod to execute it, using the specified 
-GPU slice. 
-
-For example, the following configures what resources should be used to execute tasks from the `default` queue: 
-
-```
-agentk8sglue:
-  queue: default 
-  # … 
-  basePodTemplate:
-    # …
-    resources:
-      limits:
-        nvidia.com/gpu: 1
-  nodeSelector:
-    nvidia.com/gpu.product: A100-SXM4-40GB-MIG-1g.5gb
-```
-
 :::important Enterprise Feature
 The ClearML Enterprise plan supports K8S servicing multiple ClearML queues, as well as providing a pod template for each 
 queue for describing the resources for each pod to use.
@ -691,6 +642,364 @@ Notice that a minimum and maximum value of GPUs is specified for the `opportunis
 will pull a Task from the `opportunistic` queue and allocate up to 4 GPUs based on availability (i.e. GPUs not currently
 being used by other agents).

+## Fractional GPUs
+
+Some tasks that you send for execution need a minimal amount of compute and memory, but you end up allocating entire 
+GPUs to them. In order to optimize your compute resource usage, you can partition GPUs into slices. You can have a GPU 
+device run multiple isolated workloads on separate slices that will not impact each other, and will only use the 
+fraction of GPU memory allocated to them. 
+
+ClearML provides several GPU slicing options to optimize compute resource utilization: 
+* [Container-based Memory Limits](#container-based-memory-limits): Use pre-packaged containers with built-in memory 
+limits to run multiple containers on the same GPU (**Available as part of the ClearML open source offering**)
+* [Kubernetes-based Static MIG Slicing](#kubernetes-static-mig-fractions): Set up Kubernetes support for NVIDIA MIG 
+(Multi-Instance GPU) to define GPU fractions for specific workloads (**Available as part of the ClearML open source offering**)
+* Dynamic GPU Slicing: On-demand GPU slicing per task for both MIG and non-MIG devices (**Available under the ClearML Enterprise plan**):
+  * [Bare Metal deployment](#bare-metal-deployment) 
+  * [Kubernetes deployment](#kubernetes-deployment)
+ 
+### Container-based Memory Limits
+Use [`clearml-fractional-gpu`](https://github.com/allegroai/clearml-fractional-gpu)'s pre-packaged containers with 
+built-in hard memory limitations. Workloads running in these containers will only be able to use up to the container's 
+memory limit. Multiple isolated workloads can run on the same GPU without impacting each other. 
+
+![Fractional GPU diagram](img/fractional_gpu_diagram.png)
+
+#### Usage 
+
+##### Manual Execution 
+
+1. Choose the container with the appropriate memory limit. ClearML supports CUDA 11.x and CUDA 12.x with memory limits 
+ranging from 2 GB to 12 GB (see [clearml-fractional-gpu repository](https://github.com/allegroai/clearml-fractional-gpu/blob/main/README.md#-containers) for full list).
+1. Launch the container:
+
+   ```bash
+   docker run -it --gpus 0 --ipc=host --pid=host clearml/fractional-gpu:u22-cu12.3-8gb bash 
+   ```
+   
+   This example runs the ClearML Ubuntu 22 with CUDA 12.3 container on GPU 0, which is limited to use up to 8GB of its memory.
+   :::note
+   --pid=host is required to allow the driver to differentiate between the container's processes and other host processes when limiting memory usage
+   :::
+1. Run the following command inside the container to verify that the fractional gpu memory limit is working correctly:
+   ```bash
+   nvidia-smi
+   ```
+   Here is the expected output for the previous, 8GB limited, example on an A100: 
+   ```bash
+   +---------------------------------------------------------------------------------------+
+   | NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
+   |-----------------------------------------+----------------------+----------------------+
+   | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
+   | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
+   |                                         |                      |               MIG M. |
+   |=========================================+======================+======================|
+   |   0  A100-PCIE-40GB                Off  | 00000000:01:00.0 Off |                  N/A |
+   | 32%   33C    P0              66W / 250W |      0MiB /  8128MiB |      3%      Default |
+   |                                         |                      |             Disabled |
+   +-----------------------------------------+----------------------+----------------------+
+                                                                                            
+   +---------------------------------------------------------------------------------------+
+   | Processes:                                                                            |
+   |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
+   |        ID   ID                                                             Usage      |
+   |=======================================================================================|
+   +---------------------------------------------------------------------------------------+
+   ```
+##### Remote Execution
+
+You can set a ClearML Agent to execute tasks in a fractional GPU container. Set an agent’s default container via its 
+command line. For example, all tasks pulled from the `default` queue by this agent will be executed in the Ubuntu 22 
+with CUDA 12.3 container, which is limited to use up to 8GB of its memory:
+
+```bash
+clearml-agent daemon --queue default --docker clearml/fractional-gpu:u22-cu12.3-8gb
+```
+
+The agent’s default container can be overridden via the UI: 
+1. Clone the task
+1. Set the Docker in the cloned task's **Execution** tab > **Container** section
+   
+   ![Task container](img/fractional_gpu_task_container.png)
+
+1. Enqueue the cloned task
+
+The task will be executed in the container specified in the UI.
+
+For more information, see [Docker Mode](#docker-mode).
+
+##### Fractional GPU Containers on Kubernetes
+Fractional GPU containers can be used to limit the memory consumption of your Kubernetes Job/Pod, and have multiple 
+containers share GPU devices without interfering with each other.
+
+For example, the following configures a K8s pod to run using the `clearml/fractional-gpu:u22-cu12.3-8gb` container, 
+which limits the pod to 8 GB of the GPU's memory:
+```
+apiVersion: v1
+kind: Pod
+metadata:
+  name: train-pod
+  labels:
+    app: trainme
+spec:
+  hostPID: true
+  containers:
+  - name: train-container
+    image: clearml/fractional-gpu:u22-cu12.3-8gb
+    command: ['python3', '-c', 'print(f"Free GPU Memory: (free, global) {torch.cuda.mem_get_info()}")']
+```
+
+:::note
+`hostPID: true` is required to allow the driver to differentiate between the pod's processes and other host processes 
+when limiting memory usage.
+:::
+
+#### Custom Container
+Build your own custom fractional GPU container by inheriting from one of ClearML's containers: In your Dockerfile, make 
+sure to include `From <clearml_container_image>` so the container will inherit from the relevant container.
+
+See example custom Dockerfiles in the [clearml-fractional-gpu repository](https://github.com/allegroai/clearml-fractional-gpu/tree/main/examples).
+
+### Kubernetes Static MIG Fractions
+Set up NVIDIA MIG (Multi-Instance GPU) support for Kubernetes to define GPU fraction profiles for specific workloads 
+through your NVIDIA device plugin.
+
+The ClearML Agent Helm chart lets you specify a pod template for each queue which describes the resources that the pod 
+will use. The template should specify the requested GPU slices under `Containers.resources.limits` to have the pods use 
+the defined resources. For example, the following configures a K8s pod to run a `3g.20gb` MIG device:
+```
+# tf-benchmarks-mixed.yaml
+apiVersion: v1
+kind: Pod
+metadata:
+ name: tf-benchmarks-mixed
+spec:
+ restartPolicy: Never
+ Containers:
+    - name: tf-benchmarks-mixed
+    image: ""
+     command: []
+     args: []
+     resources:
+       limits:
+         nvidia.com/mig-3g.20gb: 1
+ nodeSelector:  #optional
+   nvidia.com/gpu.product: A100-SXM4-40GB
+```
+
+When tasks are added to the relevant queue, the agent pulls the task and creates a pod to execute it, using the 
+specified GPU slice.
+
+For example, the following configures tasks from the default queue to use `1g.5gb` MIG slices:
+```
+agentk8sglue:
+ queue: default
+ # …
+ basePodTemplate:
+   # …
+   resources:
+     limits:
+       nvidia.com/gpu: 1
+ nodeSelector:
+   nvidia.com/gpu.product: A100-SXM4-40GB-MIG-1g.5gb
+```
+
+### Dynamic GPU Fractions
+
+:::important Enterprise Feature
+Dynamic GPU slicing is available under the ClearML Enterprise plan. 
+:::
+
+ClearML dynamic GPU fractions provide on-the-fly, per task GPU slicing, without having to set up containers or 
+pre-configure tasks with memory limits. Specify a GPU fraction for a queue in the agent invocation, and every task the 
+agent pulls from the queue will run on a container with the specified limit. This way you can safely run multiple tasks 
+simultaneously without worrying that one task will use all of the GPU's memory. 
+
+You can dynamically slice GPUs on [bare metal](#bare-metal-deployment) or on [Kubernetes](#kubernetes-deployment), for 
+both MIG-enabled and non-MIG devices.
+
+#### Bare Metal Deployment
+1. Install the required packages:
+
+   ```bash
+   pip install clearml-agent clearml-agent-fractional-gpu
+   ```
+1. Start the ClearML agent with dynamic GPU allocation. Use `--gpus` to specify the active GPUs, and use the `--queue` 
+   flag to specify the queue name(s) and number (or fraction) of GPUs to allocate to them. 
+
+   ```
+   clearml-agent daemon --dynamic-gpus --gpus 0, 1 --queue half_gpu=0.5
+   ```
+
+The agent can utilize 2 GPUs (GPUs 0 and 1). Every task enqueued to the `half_gpu` queue will be run by the agent and 
+only allocated 50% GPU memory (i.e. 4 tasks can run concurrently). 
+
+:::note
+You can allocate GPUs for a queue’s tasks by specifying either a fraction of a single GPU in increments as small as 0.125 
+(e.g. 0.125, 0.25, 0.50, etc.) or whole GPUs (e.g. 1, 2, 4, etc.). However, you cannot specify fractions greater than 
+one GPU (e.g. 1.25).
+::: 
+
+You can set up multiple queues, each allocated a different number of GPUs per task. Note that the order that the queues 
+are listed is their order of priority, so the agent will service tasks from the first listed queue before servicing 
+subsequent queues:
+```
+clearml-agent daemon --dynamic-gpus --gpus 0-2 --queue dual_gpus=2 quarter_gpu=0.25 half_gpu=0.5 single_gpu=1 
+```
+
+This agent will utilize 3 GPUs (GPUs 0, 1, and 2). The agent can spin multiple jobs from the different queues based on 
+the number of GPUs configured to the queue. 
+
+##### Example Workflow
+Let’s say that four tasks are enqueued, one task for each of the above queues (`dual_gpus`, `quarter_gpu`, `half_gpu`, 
+`single_gpu`). The agent will first pull the task from the `dual_gpus` queue since it is listed first, and will run it 
+using 2 GPUs. It will next run the tasks from `quarter_gpu` and `half_gpu`--both will run on the remaining available 
+GPU. This leaves the task in the `single_gpu` queue. Currently 2.75 GPUs out of the 3 are in use so the task will only 
+be pulled and run when enough GPUs become available. 
+
+#### Kubernetes Deployment
+
+ClearML supports fractional GPUs on Kubernetes through custom Enterprise Helm Charts for both MIG and non-MIG devices: 
+* `clearml-dynamic-mig-operator` for [MIG devices](#mig-enabled-gpus)
+* `clearml-fractional-gpu-injector` for [non-MIG devices](#non-mig-devices) 
+
+For either setup, you can set up in your Enterprise ClearML Agent Helm chart the resources requirements of tasks sent to 
+each queue. When a task is enqueued in ClearML, it translates into a Kubernetes pod running on the designated device 
+with the specified fractional resource as defined in the Agent Helm chart. 
+
+##### MIG-enabled GPUs 
+The **ClearML Dynamic MIG Operator** (CDMO) chart enables running AI workloads on K8s with optimized hardware utilization 
+and workload performance by facilitating MIG GPU partitioning. Make sure you have a [MIG capable GPU](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#supported-gpus).
+
+###### Prepare Cluster 
+* Install the [NVIDIA GPU Operator](https://github.com/NVIDIA/gpu-operator):
+
+   ```
+   helm repo add nvidia https://helm.ngc.nvidia.com
+   helm repo update
+   
+   helm install -n gpu-operator \
+        gpu-operator \
+        nvidia/gpu-operator \
+        --create-namespace \
+        --set migManager.enabled=false \
+        --set mig.strategy=mixed
+   ```
+* Enable MIG support:
+    1. Enable dynamic MIG support on your cluster by running following command on all nodes used for training (run for each GPU ID in your cluster):
+  
+       ```
+       nvidia-smi -i <gpu_id> -mig 1
+       ```
+    1. Reboot node if required.
+    1. Add following label to all nodes that will be used for training:
+  
+       ```
+       kubectl label nodes <node-name> "cdmo.clear.ml/gpu-partitioning=mig"
+       ```  
+
+###### Configure ClearML Queues
+The ClearML Enterprise plan supports K8S servicing multiple ClearML queues, as well as providing a pod template for each 
+queue for describing the resources for each pod to use.
+
+In the `values.yaml` file, set the resource requirements of each ClearML queue. For example, the following configures 
+what resources to use for the `default025` and the `default050` queues: 
+```
+agentk8sglue:
+  queues:
+default025:
+      templateOverrides:
+        labels:
+          required-resources: "0.25"
+        resources:
+          limits:
+            nvidia.com/mig-1g.10gb: 1
+default050:
+      templateOverrides:
+        labels:
+          required-resources: "0.50"
+        resources:
+          limits:
+            nvidia.com/mig-1g.10gb: 1
+```
+
+##### Non-MIG Devices
+The **Fractional GPU Injector** chart enables running AI workloads on k8s in an optimized way, allowing you to use 
+fractional GPUs on non-MIG devices.
+
+###### Requirements
+Install the [Nvidia GPU Operator](https://github.com/NVIDIA/gpu-operator) through the Helm chart. Make sure `timeSlicing` 
+is enabled.
+
+For example:
+```
+devicePlugin:
+  config:
+    name: device-plugin-config
+    create: true
+    default: "any"
+    data:
+      any: |-
+        version: v1
+        flags:
+          migStrategy: none
+        sharing:
+          timeSlicing:
+            renameByDefault: false
+            failRequestsGreaterThanOne: false
+            resources:
+              - name: nvidia.com/gpu
+                replicas: 4
+```
+
+The number of replicas is the maximum number of slices on a GPU.
+
+###### Configure ClearML Queues
+In the `values.yaml` file, set the resource requirements of each ClearML queue. When a task is enqueued to the queue, 
+it translates into a Kubernetes pod running on the designated device with the specified resource slice. The queues must 
+be configured with specific labels and annotations. For example, the following configures the `default0500` queue to use 
+50% of a GPU and the `default0250` queue to use 25% of a GPU:
+```
+agentk8sglue:
+  queues:
+    default0500:
+      templateOverrides:
+        labels:
+          required-resources: "0.5"
+          clearml-injector/fraction: "0.500"
+        resources:
+          limits:
+            nvidia.com/gpu: 1
+            clear.ml/fraction-1: "0.5"
+      queueSettings:
+        maxPods: 10
+    default0250:
+      templateOverrides:
+        labels:
+          required-resources: "0.25"
+          clearml-injector/fraction: "0.250"
+        resources:
+          limits:
+            nvidia.com/gpu: 1
+            clear.ml/fraction-1: "0.25"
+      queueSettings:
+        maxPods: 10
+```
+If a pod has a label matching the pattern `clearml-injector/fraction: "<gpu_fraction_value>"`, the injector will 
+configure that pod to utilize the specified fraction of the GPU:
+```
+labels:
+  clearml-injector/fraction: "<gpu_fraction_value>"
+```
+Where `<gpu_fraction_value>` must be set to one of the following values:
+* "0.125"
+* "0.250"
+* "0.375"
+* "0.500"
+* "0.625"
+* "0.750"
+* "0.875"
+
 ## Services Mode
 ClearML Agent supports a **Services Mode** where, as soon as a task is launched off of its queue, the agent moves on to the 
 next task without waiting for the previous one to complete. This mode is intended for running resource-sparse tasks that 
--- a/docs/img/fractional_gpu_diagram.png
+++ b/docs/img/fractional_gpu_diagram.png
--- a/docs/img/fractional_gpu_task_container.png
+++ b/docs/img/fractional_gpu_task_container.png