mirror of
https://github.com/clearml/clearml-docs
synced 2025-03-09 21:51:47 +00:00
Add fractional GPU support info (#772)
This commit is contained in:
parent
b1d80a2719
commit
f2006f5e9b
@ -318,14 +318,93 @@ SSH_AUTH_SOCK=<file_socket> clearml-agent daemon --gpus <your config> --queue <y
|
||||
|
||||
### Kubernetes
|
||||
Agents can be deployed bare-metal or as dockers in a Kubernetes cluster. ClearML Agent adds the missing scheduling
|
||||
capabilities to Kubernetes, allows for more flexible automation from code, and gives access to all of ClearML Agent’s
|
||||
features (scheduling, job prioritization, and more).
|
||||
capabilities to Kubernetes, allows for more flexible automation from code, and gives access to all of ClearML Agent's
|
||||
features.
|
||||
|
||||
There are two options for deploying the ClearML Agent to a Kubernetes cluster:
|
||||
* Spin ClearML Agent as a long-lasting service pod
|
||||
* Map ClearML jobs directly to K8s jobs with Kubernetes Glue (available in the ClearML Enterprise plan)
|
||||
ClearML Agent is deployed onto a Kubernetes cluster through its Kubernetes-Glue which maps ClearML jobs directly to K8s
|
||||
jobs:
|
||||
* Use the [ClearML Agent Helm Chart](https://github.com/allegroai/clearml-helm-charts/tree/main/charts/clearml-agent) to
|
||||
spin an agent pod acting as a controller. Alternatively (less recommended) run a [k8s glue script](https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py)
|
||||
on a K8S cpu node
|
||||
* The ClearML K8S glue pulls jobs from the ClearML job execution queue and prepares a K8s job (based on provided yaml
|
||||
template)
|
||||
* Inside each job pod the `clearml-agent` will install the ClearML task's environment and run and monitor the experiment's
|
||||
process
|
||||
|
||||
For more details, see [Kubernetes integration](https://github.com/allegroai/clearml-agent#kubernetes-integration-optional).
|
||||
#### Fractional GPUs
|
||||
Some jobs that you send for execution need a minimal amount of compute and memory, but you end up allocating entire GPUs
|
||||
to them. In order to optimize your compute resource usage, you can partition GPUs into slices.
|
||||
|
||||
Set up MIG support for Kubernetes through your NVIDIA device plugin, and define the GPU fractions to be made available
|
||||
to the cluster.
|
||||
|
||||
The ClearML Agent Helm chart lets you specify a pod template for each queue which describes the resources that the pod
|
||||
will use. The template should specify the requested GPU slices under `Containers.resources.limits` to have the queue use
|
||||
the defined resources. For example, the following configures a K8s pod to run a 3g.20gb MIG device:
|
||||
|
||||
```
|
||||
# tf-benchmarks-mixed.yaml
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: tf-benchmarks-mixed
|
||||
spec:
|
||||
restartPolicy: Never
|
||||
Containers:
|
||||
- name: tf-benchmarks-mixed
|
||||
image: ""
|
||||
command: []
|
||||
args: []
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/mig-3g.20gb: 1
|
||||
nodeSelector: #optional
|
||||
nvidia.com/gpu.product: A100-SXM4-40GB
|
||||
```
|
||||
|
||||
When tasks are added to the relevant queue, the agent pulls the task and creates a pod to execute it, using the specified
|
||||
GPU slice.
|
||||
|
||||
For example, the following configures what resources should be used to execute tasks from the `default` queue:
|
||||
|
||||
```
|
||||
agentk8sglue:
|
||||
queue: default
|
||||
# …
|
||||
basePodTemplate:
|
||||
# …
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
nodeSelector:
|
||||
nvidia.com/gpu.product: A100-SXM4-40GB-MIG-1g.5gb
|
||||
```
|
||||
|
||||
:::important Enterprise Feature
|
||||
The ClearML Enterprise plan supports K8S servicing multiple ClearML queues, as well as providing a pod template for each
|
||||
queue for describing the resources for each pod to use.
|
||||
|
||||
For example, the following configures which resources to use for `example_queue_1` and `example_queue_2`:
|
||||
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
queues:
|
||||
example_queue_1:
|
||||
templateOverrides:
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
nodeSelector:
|
||||
nvidia.com/gpu.product: A100-SXM4-40GB-MIG-1g.5gb
|
||||
example_queue_2:
|
||||
templateOverrides:
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 2
|
||||
nodeSelector:
|
||||
nvidia.com/gpu.product: A100-SXM4-40GB
|
||||
```
|
||||
:::
|
||||
|
||||
### Slurm
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user