Add fractional GPU support info (#772)

2025-06-26 18:17:44 +00:00 · 2024-02-11 16:59:59 +02:00 · 2024-02-11 16:59:59 +02:00 · f2006f5e9b
commit f2006f5e9b
parent b1d80a2719
1 changed files with 85 additions and 6 deletions
--- a/docs/clearml_agent.md
+++ b/docs/clearml_agent.md
@ -318,14 +318,93 @@ SSH_AUTH_SOCK=<file_socket> clearml-agent daemon --gpus <your config> --queue <y

 ### Kubernetes 
 Agents can be deployed bare-metal or as dockers in a Kubernetes cluster. ClearML Agent adds the missing scheduling 
-capabilities to Kubernetes, allows for more flexible automation from code, and gives access to all of ClearML Agent’s 
-features (scheduling, job prioritization, and more).
+capabilities to Kubernetes, allows for more flexible automation from code, and gives access to all of ClearML Agent's 
+features.

-There are two options for deploying the ClearML Agent to a Kubernetes cluster:
-* Spin ClearML Agent as a long-lasting service pod
-* Map ClearML jobs directly to K8s jobs with Kubernetes Glue (available in the ClearML Enterprise plan)
+ClearML Agent is deployed onto a Kubernetes cluster through its Kubernetes-Glue  which maps ClearML jobs directly to K8s 
+jobs:
+* Use the [ClearML Agent Helm Chart](https://github.com/allegroai/clearml-helm-charts/tree/main/charts/clearml-agent) to
+spin an agent pod acting as a controller. Alternatively (less recommended) run a [k8s glue script](https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py) 
+on a K8S cpu node
+* The ClearML K8S glue pulls jobs from the ClearML job execution queue and prepares a K8s job (based on provided yaml 
+template)
+* Inside each job pod the `clearml-agent` will install the ClearML task's environment and run and monitor the experiment's 
+process

-For more details, see [Kubernetes integration](https://github.com/allegroai/clearml-agent#kubernetes-integration-optional).
+#### Fractional GPUs
+Some jobs that you send for execution need a minimal amount of compute and memory, but you end up allocating entire GPUs 
+to them. In order to optimize your compute resource usage, you can partition GPUs into slices. 
+
+Set up MIG support for Kubernetes through your NVIDIA device plugin, and define the GPU fractions to be made available 
+to the cluster. 
+
+The ClearML Agent Helm chart lets you specify a pod template for each queue which describes the resources that the pod
+will use. The template should specify the requested GPU slices under `Containers.resources.limits` to have the queue use 
+the defined resources. For example, the following configures a K8s pod to run a 3g.20gb MIG device:
+
+```
+# tf-benchmarks-mixed.yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: tf-benchmarks-mixed
+spec:
+  restartPolicy: Never
+  Containers:
+     - name: tf-benchmarks-mixed
+     image: ""
+      command: []
+      args: []
+      resources:
+        limits:
+          nvidia.com/mig-3g.20gb: 1
+  nodeSelector:  #optional
+    nvidia.com/gpu.product: A100-SXM4-40GB
+```
+
+When tasks are added to the relevant queue, the agent pulls the task and creates a pod to execute it, using the specified 
+GPU slice. 
+
+For example, the following configures what resources should be used to execute tasks from the `default` queue: 
+
+```
+agentk8sglue:
+  queue: default 
+  # … 
+  basePodTemplate:
+    # …
+    resources:
+      limits:
+        nvidia.com/gpu: 1
+  nodeSelector:
+    nvidia.com/gpu.product: A100-SXM4-40GB-MIG-1g.5gb
+```
+
+:::important Enterprise Feature
+The ClearML Enterprise plan supports K8S servicing multiple ClearML queues, as well as providing a pod template for each 
+queue for describing the resources for each pod to use.
+
+For example, the following configures which resources to use for `example_queue_1` and `example_queue_2`:
+
+```yaml
+agentk8sglue:
+  queues:
+    example_queue_1:
+      templateOverrides:
+        resources:
+          limits:
+            nvidia.com/gpu: 1
+      nodeSelector:
+        nvidia.com/gpu.product: A100-SXM4-40GB-MIG-1g.5gb
+    example_queue_2:
+      templateOverrides:
+        resources:
+          limits:
+            nvidia.com/gpu: 2
+      nodeSelector:
+        nvidia.com/gpu.product: A100-SXM4-40GB
+```
+:::

 ### Slurm