Add fractional GPU support info (#772)

This commit is contained in:
pollfly 2024-02-11 16:59:59 +02:00 committed by GitHub
parent b1d80a2719
commit f2006f5e9b
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -318,14 +318,93 @@ SSH_AUTH_SOCK=<file_socket> clearml-agent daemon --gpus <your config> --queue <y
### Kubernetes
Agents can be deployed bare-metal or as dockers in a Kubernetes cluster. ClearML Agent adds the missing scheduling
capabilities to Kubernetes, allows for more flexible automation from code, and gives access to all of ClearML Agents
features (scheduling, job prioritization, and more).
capabilities to Kubernetes, allows for more flexible automation from code, and gives access to all of ClearML Agent's
features.
There are two options for deploying the ClearML Agent to a Kubernetes cluster:
* Spin ClearML Agent as a long-lasting service pod
* Map ClearML jobs directly to K8s jobs with Kubernetes Glue (available in the ClearML Enterprise plan)
ClearML Agent is deployed onto a Kubernetes cluster through its Kubernetes-Glue which maps ClearML jobs directly to K8s
jobs:
* Use the [ClearML Agent Helm Chart](https://github.com/allegroai/clearml-helm-charts/tree/main/charts/clearml-agent) to
spin an agent pod acting as a controller. Alternatively (less recommended) run a [k8s glue script](https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py)
on a K8S cpu node
* The ClearML K8S glue pulls jobs from the ClearML job execution queue and prepares a K8s job (based on provided yaml
template)
* Inside each job pod the `clearml-agent` will install the ClearML task's environment and run and monitor the experiment's
process
For more details, see [Kubernetes integration](https://github.com/allegroai/clearml-agent#kubernetes-integration-optional).
#### Fractional GPUs
Some jobs that you send for execution need a minimal amount of compute and memory, but you end up allocating entire GPUs
to them. In order to optimize your compute resource usage, you can partition GPUs into slices.
Set up MIG support for Kubernetes through your NVIDIA device plugin, and define the GPU fractions to be made available
to the cluster.
The ClearML Agent Helm chart lets you specify a pod template for each queue which describes the resources that the pod
will use. The template should specify the requested GPU slices under `Containers.resources.limits` to have the queue use
the defined resources. For example, the following configures a K8s pod to run a 3g.20gb MIG device:
```
# tf-benchmarks-mixed.yaml
apiVersion: v1
kind: Pod
metadata:
name: tf-benchmarks-mixed
spec:
restartPolicy: Never
Containers:
- name: tf-benchmarks-mixed
image: ""
command: []
args: []
resources:
limits:
nvidia.com/mig-3g.20gb: 1
nodeSelector: #optional
nvidia.com/gpu.product: A100-SXM4-40GB
```
When tasks are added to the relevant queue, the agent pulls the task and creates a pod to execute it, using the specified
GPU slice.
For example, the following configures what resources should be used to execute tasks from the `default` queue:
```
agentk8sglue:
queue: default
# …
basePodTemplate:
# …
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
nvidia.com/gpu.product: A100-SXM4-40GB-MIG-1g.5gb
```
:::important Enterprise Feature
The ClearML Enterprise plan supports K8S servicing multiple ClearML queues, as well as providing a pod template for each
queue for describing the resources for each pod to use.
For example, the following configures which resources to use for `example_queue_1` and `example_queue_2`:
```yaml
agentk8sglue:
queues:
example_queue_1:
templateOverrides:
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
nvidia.com/gpu.product: A100-SXM4-40GB-MIG-1g.5gb
example_queue_2:
templateOverrides:
resources:
limits:
nvidia.com/gpu: 2
nodeSelector:
nvidia.com/gpu.product: A100-SXM4-40GB
```
:::
### Slurm