mirror of
https://github.com/clearml/clearml-docs
synced 2025-05-21 12:36:42 +00:00
Add fractional GPU support info (#772)
This commit is contained in:
parent
b1d80a2719
commit
f2006f5e9b
@ -318,14 +318,93 @@ SSH_AUTH_SOCK=<file_socket> clearml-agent daemon --gpus <your config> --queue <y
|
|||||||
|
|
||||||
### Kubernetes
|
### Kubernetes
|
||||||
Agents can be deployed bare-metal or as dockers in a Kubernetes cluster. ClearML Agent adds the missing scheduling
|
Agents can be deployed bare-metal or as dockers in a Kubernetes cluster. ClearML Agent adds the missing scheduling
|
||||||
capabilities to Kubernetes, allows for more flexible automation from code, and gives access to all of ClearML Agent’s
|
capabilities to Kubernetes, allows for more flexible automation from code, and gives access to all of ClearML Agent's
|
||||||
features (scheduling, job prioritization, and more).
|
features.
|
||||||
|
|
||||||
There are two options for deploying the ClearML Agent to a Kubernetes cluster:
|
ClearML Agent is deployed onto a Kubernetes cluster through its Kubernetes-Glue which maps ClearML jobs directly to K8s
|
||||||
* Spin ClearML Agent as a long-lasting service pod
|
jobs:
|
||||||
* Map ClearML jobs directly to K8s jobs with Kubernetes Glue (available in the ClearML Enterprise plan)
|
* Use the [ClearML Agent Helm Chart](https://github.com/allegroai/clearml-helm-charts/tree/main/charts/clearml-agent) to
|
||||||
|
spin an agent pod acting as a controller. Alternatively (less recommended) run a [k8s glue script](https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py)
|
||||||
|
on a K8S cpu node
|
||||||
|
* The ClearML K8S glue pulls jobs from the ClearML job execution queue and prepares a K8s job (based on provided yaml
|
||||||
|
template)
|
||||||
|
* Inside each job pod the `clearml-agent` will install the ClearML task's environment and run and monitor the experiment's
|
||||||
|
process
|
||||||
|
|
||||||
For more details, see [Kubernetes integration](https://github.com/allegroai/clearml-agent#kubernetes-integration-optional).
|
#### Fractional GPUs
|
||||||
|
Some jobs that you send for execution need a minimal amount of compute and memory, but you end up allocating entire GPUs
|
||||||
|
to them. In order to optimize your compute resource usage, you can partition GPUs into slices.
|
||||||
|
|
||||||
|
Set up MIG support for Kubernetes through your NVIDIA device plugin, and define the GPU fractions to be made available
|
||||||
|
to the cluster.
|
||||||
|
|
||||||
|
The ClearML Agent Helm chart lets you specify a pod template for each queue which describes the resources that the pod
|
||||||
|
will use. The template should specify the requested GPU slices under `Containers.resources.limits` to have the queue use
|
||||||
|
the defined resources. For example, the following configures a K8s pod to run a 3g.20gb MIG device:
|
||||||
|
|
||||||
|
```
|
||||||
|
# tf-benchmarks-mixed.yaml
|
||||||
|
apiVersion: v1
|
||||||
|
kind: Pod
|
||||||
|
metadata:
|
||||||
|
name: tf-benchmarks-mixed
|
||||||
|
spec:
|
||||||
|
restartPolicy: Never
|
||||||
|
Containers:
|
||||||
|
- name: tf-benchmarks-mixed
|
||||||
|
image: ""
|
||||||
|
command: []
|
||||||
|
args: []
|
||||||
|
resources:
|
||||||
|
limits:
|
||||||
|
nvidia.com/mig-3g.20gb: 1
|
||||||
|
nodeSelector: #optional
|
||||||
|
nvidia.com/gpu.product: A100-SXM4-40GB
|
||||||
|
```
|
||||||
|
|
||||||
|
When tasks are added to the relevant queue, the agent pulls the task and creates a pod to execute it, using the specified
|
||||||
|
GPU slice.
|
||||||
|
|
||||||
|
For example, the following configures what resources should be used to execute tasks from the `default` queue:
|
||||||
|
|
||||||
|
```
|
||||||
|
agentk8sglue:
|
||||||
|
queue: default
|
||||||
|
# …
|
||||||
|
basePodTemplate:
|
||||||
|
# …
|
||||||
|
resources:
|
||||||
|
limits:
|
||||||
|
nvidia.com/gpu: 1
|
||||||
|
nodeSelector:
|
||||||
|
nvidia.com/gpu.product: A100-SXM4-40GB-MIG-1g.5gb
|
||||||
|
```
|
||||||
|
|
||||||
|
:::important Enterprise Feature
|
||||||
|
The ClearML Enterprise plan supports K8S servicing multiple ClearML queues, as well as providing a pod template for each
|
||||||
|
queue for describing the resources for each pod to use.
|
||||||
|
|
||||||
|
For example, the following configures which resources to use for `example_queue_1` and `example_queue_2`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
agentk8sglue:
|
||||||
|
queues:
|
||||||
|
example_queue_1:
|
||||||
|
templateOverrides:
|
||||||
|
resources:
|
||||||
|
limits:
|
||||||
|
nvidia.com/gpu: 1
|
||||||
|
nodeSelector:
|
||||||
|
nvidia.com/gpu.product: A100-SXM4-40GB-MIG-1g.5gb
|
||||||
|
example_queue_2:
|
||||||
|
templateOverrides:
|
||||||
|
resources:
|
||||||
|
limits:
|
||||||
|
nvidia.com/gpu: 2
|
||||||
|
nodeSelector:
|
||||||
|
nvidia.com/gpu.product: A100-SXM4-40GB
|
||||||
|
```
|
||||||
|
:::
|
||||||
|
|
||||||
### Slurm
|
### Slurm
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user