Add fractional GPU support info (#772)

2025-06-26 18:17:44 +00:00 · 2024-02-11 16:59:59 +02:00 · 2024-02-11 16:59:59 +02:00 · f2006f5e9b
commit f2006f5e9b
parent b1d80a2719
1 changed files with 85 additions and 6 deletions
--- a/docs/clearml_agent.md
+++ b/docs/clearml_agent.md
@ -318,14 +318,93 @@ SSH_AUTH_SOCK=<file_socket> clearml-agent daemon --gpus <your config> --queue <y
 ### Kubernetes 
 Agents can be deployed bare-metal or as dockers in a Kubernetes cluster. ClearML Agent adds the missing scheduling 
-capabilities to Kubernetes, allows for more flexible automation from code, and gives access to all of ClearML Agent’s 
+capabilities to Kubernetes, allows for more flexible automation from code, and gives access to all of ClearML Agent's 
-features (scheduling, job prioritization, and more).
+features.
-There are two options for deploying the ClearML Agent to a Kubernetes cluster:
+ClearML Agent is deployed onto a Kubernetes cluster through its Kubernetes-Glue  which maps ClearML jobs directly to K8s 
-* Spin ClearML Agent as a long-lasting service pod
+jobs:
-* Map ClearML jobs directly to K8s jobs with Kubernetes Glue (available in the ClearML Enterprise plan)
+* Use the [ClearML Agent Helm Chart](https://github.com/allegroai/clearml-helm-charts/tree/main/charts/clearml-agent) to
 spin an agent pod acting as a controller. Alternatively (less recommended) run a [k8s glue script](https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py) 
 on a K8S cpu node
 * The ClearML K8S glue pulls jobs from the ClearML job execution queue and prepares a K8s job (based on provided yaml 
 template)
 * Inside each job pod the `clearml-agent` will install the ClearML task's environment and run and monitor the experiment's 
 process
-For more details, see [Kubernetes integration](https://github.com/allegroai/clearml-agent#kubernetes-integration-optional).
+#### Fractional GPUs
 Some jobs that you send for execution need a minimal amount of compute and memory, but you end up allocating entire GPUs 
 to them. In order to optimize your compute resource usage, you can partition GPUs into slices. 
 Set up MIG support for Kubernetes through your NVIDIA device plugin, and define the GPU fractions to be made available 
 to the cluster. 
 The ClearML Agent Helm chart lets you specify a pod template for each queue which describes the resources that the pod
 will use. The template should specify the requested GPU slices under `Containers.resources.limits` to have the queue use 
 the defined resources. For example, the following configures a K8s pod to run a 3g.20gb MIG device:
 ```
 # tf-benchmarks-mixed.yaml
 apiVersion: v1
 kind: Pod
 metadata:
  name: tf-benchmarks-mixed
 spec:
  restartPolicy: Never
  Containers:
     - name: tf-benchmarks-mixed
     image: ""
      command: []
      args: []
      resources:
        limits:
          nvidia.com/mig-3g.20gb: 1
  nodeSelector:  #optional
    nvidia.com/gpu.product: A100-SXM4-40GB
 ```
 When tasks are added to the relevant queue, the agent pulls the task and creates a pod to execute it, using the specified 
 GPU slice. 
 For example, the following configures what resources should be used to execute tasks from the `default` queue: 
 ```
 agentk8sglue:
  queue: default 
  # … 
  basePodTemplate:
    # …
    resources:
      limits:
        nvidia.com/gpu: 1
  nodeSelector:
    nvidia.com/gpu.product: A100-SXM4-40GB-MIG-1g.5gb
 ```
 :::important Enterprise Feature
 The ClearML Enterprise plan supports K8S servicing multiple ClearML queues, as well as providing a pod template for each 
 queue for describing the resources for each pod to use.
 For example, the following configures which resources to use for `example_queue_1` and `example_queue_2`:
 ```yaml
 agentk8sglue:
  queues:
    example_queue_1:
      templateOverrides:
        resources:
          limits:
            nvidia.com/gpu: 1
      nodeSelector:
        nvidia.com/gpu.product: A100-SXM4-40GB-MIG-1g.5gb
    example_queue_2:
      templateOverrides:
        resources:
          limits:
            nvidia.com/gpu: 2
      nodeSelector:
        nvidia.com/gpu.product: A100-SXM4-40GB
 ```
 :::
 ### Slurm