diff --git a/docs/clearml_agent.md b/docs/clearml_agent.md index e1683596..85d29a47 100644 --- a/docs/clearml_agent.md +++ b/docs/clearml_agent.md @@ -331,55 +331,6 @@ template) * Inside each job pod the `clearml-agent` will install the ClearML task's environment and run and monitor the experiment's process -#### Fractional GPUs -Some jobs that you send for execution need a minimal amount of compute and memory, but you end up allocating entire GPUs -to them. In order to optimize your compute resource usage, you can partition GPUs into slices. - -Set up MIG support for Kubernetes through your NVIDIA device plugin, and define the GPU fractions to be made available -to the cluster. - -The ClearML Agent Helm chart lets you specify a pod template for each queue which describes the resources that the pod -will use. The template should specify the requested GPU slices under `Containers.resources.limits` to have the queue use -the defined resources. For example, the following configures a K8s pod to run a 3g.20gb MIG device: - -``` -# tf-benchmarks-mixed.yaml -apiVersion: v1 -kind: Pod -metadata: - name: tf-benchmarks-mixed -spec: - restartPolicy: Never - Containers: - - name: tf-benchmarks-mixed - image: "" - command: [] - args: [] - resources: - limits: - nvidia.com/mig-3g.20gb: 1 - nodeSelector: #optional - nvidia.com/gpu.product: A100-SXM4-40GB -``` - -When tasks are added to the relevant queue, the agent pulls the task and creates a pod to execute it, using the specified -GPU slice. - -For example, the following configures what resources should be used to execute tasks from the `default` queue: - -``` -agentk8sglue: - queue: default - # … - basePodTemplate: - # … - resources: - limits: - nvidia.com/gpu: 1 - nodeSelector: - nvidia.com/gpu.product: A100-SXM4-40GB-MIG-1g.5gb -``` - :::important Enterprise Feature The ClearML Enterprise plan supports K8S servicing multiple ClearML queues, as well as providing a pod template for each queue for describing the resources for each pod to use. @@ -691,6 +642,364 @@ Notice that a minimum and maximum value of GPUs is specified for the `opportunis will pull a Task from the `opportunistic` queue and allocate up to 4 GPUs based on availability (i.e. GPUs not currently being used by other agents). +## Fractional GPUs + +Some tasks that you send for execution need a minimal amount of compute and memory, but you end up allocating entire +GPUs to them. In order to optimize your compute resource usage, you can partition GPUs into slices. You can have a GPU +device run multiple isolated workloads on separate slices that will not impact each other, and will only use the +fraction of GPU memory allocated to them. + +ClearML provides several GPU slicing options to optimize compute resource utilization: +* [Container-based Memory Limits](#container-based-memory-limits): Use pre-packaged containers with built-in memory +limits to run multiple containers on the same GPU (**Available as part of the ClearML open source offering**) +* [Kubernetes-based Static MIG Slicing](#kubernetes-static-mig-fractions): Set up Kubernetes support for NVIDIA MIG +(Multi-Instance GPU) to define GPU fractions for specific workloads (**Available as part of the ClearML open source offering**) +* Dynamic GPU Slicing: On-demand GPU slicing per task for both MIG and non-MIG devices (**Available under the ClearML Enterprise plan**): + * [Bare Metal deployment](#bare-metal-deployment) + * [Kubernetes deployment](#kubernetes-deployment) + +### Container-based Memory Limits +Use [`clearml-fractional-gpu`](https://github.com/allegroai/clearml-fractional-gpu)'s pre-packaged containers with +built-in hard memory limitations. Workloads running in these containers will only be able to use up to the container's +memory limit. Multiple isolated workloads can run on the same GPU without impacting each other. + +![Fractional GPU diagram](img/fractional_gpu_diagram.png) + +#### Usage + +##### Manual Execution + +1. Choose the container with the appropriate memory limit. ClearML supports CUDA 11.x and CUDA 12.x with memory limits +ranging from 2 GB to 12 GB (see [clearml-fractional-gpu repository](https://github.com/allegroai/clearml-fractional-gpu/blob/main/README.md#-containers) for full list). +1. Launch the container: + + ```bash + docker run -it --gpus 0 --ipc=host --pid=host clearml/fractional-gpu:u22-cu12.3-8gb bash + ``` + + This example runs the ClearML Ubuntu 22 with CUDA 12.3 container on GPU 0, which is limited to use up to 8GB of its memory. + :::note + --pid=host is required to allow the driver to differentiate between the container's processes and other host processes when limiting memory usage + ::: +1. Run the following command inside the container to verify that the fractional gpu memory limit is working correctly: + ```bash + nvidia-smi + ``` + Here is the expected output for the previous, 8GB limited, example on an A100: + ```bash + +---------------------------------------------------------------------------------------+ + | NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 | + |-----------------------------------------+----------------------+----------------------+ + | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | + | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | + | | | MIG M. | + |=========================================+======================+======================| + | 0 A100-PCIE-40GB Off | 00000000:01:00.0 Off | N/A | + | 32% 33C P0 66W / 250W | 0MiB / 8128MiB | 3% Default | + | | | Disabled | + +-----------------------------------------+----------------------+----------------------+ + + +---------------------------------------------------------------------------------------+ + | Processes: | + | GPU GI CI PID Type Process name GPU Memory | + | ID ID Usage | + |=======================================================================================| + +---------------------------------------------------------------------------------------+ + ``` +##### Remote Execution + +You can set a ClearML Agent to execute tasks in a fractional GPU container. Set an agent’s default container via its +command line. For example, all tasks pulled from the `default` queue by this agent will be executed in the Ubuntu 22 +with CUDA 12.3 container, which is limited to use up to 8GB of its memory: + +```bash +clearml-agent daemon --queue default --docker clearml/fractional-gpu:u22-cu12.3-8gb +``` + +The agent’s default container can be overridden via the UI: +1. Clone the task +1. Set the Docker in the cloned task's **Execution** tab > **Container** section + + ![Task container](img/fractional_gpu_task_container.png) + +1. Enqueue the cloned task + +The task will be executed in the container specified in the UI. + +For more information, see [Docker Mode](#docker-mode). + +##### Fractional GPU Containers on Kubernetes +Fractional GPU containers can be used to limit the memory consumption of your Kubernetes Job/Pod, and have multiple +containers share GPU devices without interfering with each other. + +For example, the following configures a K8s pod to run using the `clearml/fractional-gpu:u22-cu12.3-8gb` container, +which limits the pod to 8 GB of the GPU's memory: +``` +apiVersion: v1 +kind: Pod +metadata: + name: train-pod + labels: + app: trainme +spec: + hostPID: true + containers: + - name: train-container + image: clearml/fractional-gpu:u22-cu12.3-8gb + command: ['python3', '-c', 'print(f"Free GPU Memory: (free, global) {torch.cuda.mem_get_info()}")'] +``` + +:::note +`hostPID: true` is required to allow the driver to differentiate between the pod's processes and other host processes +when limiting memory usage. +::: + +#### Custom Container +Build your own custom fractional GPU container by inheriting from one of ClearML's containers: In your Dockerfile, make +sure to include `From ` so the container will inherit from the relevant container. + +See example custom Dockerfiles in the [clearml-fractional-gpu repository](https://github.com/allegroai/clearml-fractional-gpu/tree/main/examples). + +### Kubernetes Static MIG Fractions +Set up NVIDIA MIG (Multi-Instance GPU) support for Kubernetes to define GPU fraction profiles for specific workloads +through your NVIDIA device plugin. + +The ClearML Agent Helm chart lets you specify a pod template for each queue which describes the resources that the pod +will use. The template should specify the requested GPU slices under `Containers.resources.limits` to have the pods use +the defined resources. For example, the following configures a K8s pod to run a `3g.20gb` MIG device: +``` +# tf-benchmarks-mixed.yaml +apiVersion: v1 +kind: Pod +metadata: + name: tf-benchmarks-mixed +spec: + restartPolicy: Never + Containers: + - name: tf-benchmarks-mixed + image: "" + command: [] + args: [] + resources: + limits: + nvidia.com/mig-3g.20gb: 1 + nodeSelector: #optional + nvidia.com/gpu.product: A100-SXM4-40GB +``` + +When tasks are added to the relevant queue, the agent pulls the task and creates a pod to execute it, using the +specified GPU slice. + +For example, the following configures tasks from the default queue to use `1g.5gb` MIG slices: +``` +agentk8sglue: + queue: default + # … + basePodTemplate: + # … + resources: + limits: + nvidia.com/gpu: 1 + nodeSelector: + nvidia.com/gpu.product: A100-SXM4-40GB-MIG-1g.5gb +``` + +### Dynamic GPU Fractions + +:::important Enterprise Feature +Dynamic GPU slicing is available under the ClearML Enterprise plan. +::: + +ClearML dynamic GPU fractions provide on-the-fly, per task GPU slicing, without having to set up containers or +pre-configure tasks with memory limits. Specify a GPU fraction for a queue in the agent invocation, and every task the +agent pulls from the queue will run on a container with the specified limit. This way you can safely run multiple tasks +simultaneously without worrying that one task will use all of the GPU's memory. + +You can dynamically slice GPUs on [bare metal](#bare-metal-deployment) or on [Kubernetes](#kubernetes-deployment), for +both MIG-enabled and non-MIG devices. + +#### Bare Metal Deployment +1. Install the required packages: + + ```bash + pip install clearml-agent clearml-agent-fractional-gpu + ``` +1. Start the ClearML agent with dynamic GPU allocation. Use `--gpus` to specify the active GPUs, and use the `--queue` + flag to specify the queue name(s) and number (or fraction) of GPUs to allocate to them. + + ``` + clearml-agent daemon --dynamic-gpus --gpus 0, 1 --queue half_gpu=0.5 + ``` + +The agent can utilize 2 GPUs (GPUs 0 and 1). Every task enqueued to the `half_gpu` queue will be run by the agent and +only allocated 50% GPU memory (i.e. 4 tasks can run concurrently). + +:::note +You can allocate GPUs for a queue’s tasks by specifying either a fraction of a single GPU in increments as small as 0.125 +(e.g. 0.125, 0.25, 0.50, etc.) or whole GPUs (e.g. 1, 2, 4, etc.). However, you cannot specify fractions greater than +one GPU (e.g. 1.25). +::: + +You can set up multiple queues, each allocated a different number of GPUs per task. Note that the order that the queues +are listed is their order of priority, so the agent will service tasks from the first listed queue before servicing +subsequent queues: +``` +clearml-agent daemon --dynamic-gpus --gpus 0-2 --queue dual_gpus=2 quarter_gpu=0.25 half_gpu=0.5 single_gpu=1 +``` + +This agent will utilize 3 GPUs (GPUs 0, 1, and 2). The agent can spin multiple jobs from the different queues based on +the number of GPUs configured to the queue. + +##### Example Workflow +Let’s say that four tasks are enqueued, one task for each of the above queues (`dual_gpus`, `quarter_gpu`, `half_gpu`, +`single_gpu`). The agent will first pull the task from the `dual_gpus` queue since it is listed first, and will run it +using 2 GPUs. It will next run the tasks from `quarter_gpu` and `half_gpu`--both will run on the remaining available +GPU. This leaves the task in the `single_gpu` queue. Currently 2.75 GPUs out of the 3 are in use so the task will only +be pulled and run when enough GPUs become available. + +#### Kubernetes Deployment + +ClearML supports fractional GPUs on Kubernetes through custom Enterprise Helm Charts for both MIG and non-MIG devices: +* `clearml-dynamic-mig-operator` for [MIG devices](#mig-enabled-gpus) +* `clearml-fractional-gpu-injector` for [non-MIG devices](#non-mig-devices) + +For either setup, you can set up in your Enterprise ClearML Agent Helm chart the resources requirements of tasks sent to +each queue. When a task is enqueued in ClearML, it translates into a Kubernetes pod running on the designated device +with the specified fractional resource as defined in the Agent Helm chart. + +##### MIG-enabled GPUs +The **ClearML Dynamic MIG Operator** (CDMO) chart enables running AI workloads on K8s with optimized hardware utilization +and workload performance by facilitating MIG GPU partitioning. Make sure you have a [MIG capable GPU](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#supported-gpus). + +###### Prepare Cluster +* Install the [NVIDIA GPU Operator](https://github.com/NVIDIA/gpu-operator): + + ``` + helm repo add nvidia https://helm.ngc.nvidia.com + helm repo update + + helm install -n gpu-operator \ + gpu-operator \ + nvidia/gpu-operator \ + --create-namespace \ + --set migManager.enabled=false \ + --set mig.strategy=mixed + ``` +* Enable MIG support: + 1. Enable dynamic MIG support on your cluster by running following command on all nodes used for training (run for each GPU ID in your cluster): + + ``` + nvidia-smi -i -mig 1 + ``` + 1. Reboot node if required. + 1. Add following label to all nodes that will be used for training: + + ``` + kubectl label nodes "cdmo.clear.ml/gpu-partitioning=mig" + ``` + +###### Configure ClearML Queues +The ClearML Enterprise plan supports K8S servicing multiple ClearML queues, as well as providing a pod template for each +queue for describing the resources for each pod to use. + +In the `values.yaml` file, set the resource requirements of each ClearML queue. For example, the following configures +what resources to use for the `default025` and the `default050` queues: +``` +agentk8sglue: + queues: +default025: + templateOverrides: + labels: + required-resources: "0.25" + resources: + limits: + nvidia.com/mig-1g.10gb: 1 +default050: + templateOverrides: + labels: + required-resources: "0.50" + resources: + limits: + nvidia.com/mig-1g.10gb: 1 +``` + +##### Non-MIG Devices +The **Fractional GPU Injector** chart enables running AI workloads on k8s in an optimized way, allowing you to use +fractional GPUs on non-MIG devices. + +###### Requirements +Install the [Nvidia GPU Operator](https://github.com/NVIDIA/gpu-operator) through the Helm chart. Make sure `timeSlicing` +is enabled. + +For example: +``` +devicePlugin: + config: + name: device-plugin-config + create: true + default: "any" + data: + any: |- + version: v1 + flags: + migStrategy: none + sharing: + timeSlicing: + renameByDefault: false + failRequestsGreaterThanOne: false + resources: + - name: nvidia.com/gpu + replicas: 4 +``` + +The number of replicas is the maximum number of slices on a GPU. + +###### Configure ClearML Queues +In the `values.yaml` file, set the resource requirements of each ClearML queue. When a task is enqueued to the queue, +it translates into a Kubernetes pod running on the designated device with the specified resource slice. The queues must +be configured with specific labels and annotations. For example, the following configures the `default0500` queue to use +50% of a GPU and the `default0250` queue to use 25% of a GPU: +``` +agentk8sglue: + queues: + default0500: + templateOverrides: + labels: + required-resources: "0.5" + clearml-injector/fraction: "0.500" + resources: + limits: + nvidia.com/gpu: 1 + clear.ml/fraction-1: "0.5" + queueSettings: + maxPods: 10 + default0250: + templateOverrides: + labels: + required-resources: "0.25" + clearml-injector/fraction: "0.250" + resources: + limits: + nvidia.com/gpu: 1 + clear.ml/fraction-1: "0.25" + queueSettings: + maxPods: 10 +``` +If a pod has a label matching the pattern `clearml-injector/fraction: ""`, the injector will +configure that pod to utilize the specified fraction of the GPU: +``` +labels: + clearml-injector/fraction: "" +``` +Where `` must be set to one of the following values: +* "0.125" +* "0.250" +* "0.375" +* "0.500" +* "0.625" +* "0.750" +* "0.875" + ## Services Mode ClearML Agent supports a **Services Mode** where, as soon as a task is launched off of its queue, the agent moves on to the next task without waiting for the previous one to complete. This mode is intended for running resource-sparse tasks that diff --git a/docs/img/fractional_gpu_diagram.png b/docs/img/fractional_gpu_diagram.png new file mode 100644 index 00000000..9df2c903 Binary files /dev/null and b/docs/img/fractional_gpu_diagram.png differ diff --git a/docs/img/fractional_gpu_task_container.png b/docs/img/fractional_gpu_task_container.png new file mode 100644 index 00000000..424ce5a6 Binary files /dev/null and b/docs/img/fractional_gpu_task_container.png differ