mirror of
https://github.com/clearml/clearml-fractional-gpu
synced 2025-01-30 18:36:58 +00:00
commit
03d85b2773
63
README.md
63
README.md
@ -10,21 +10,42 @@
|
|||||||
## 🔰 Introduction
|
## 🔰 Introduction
|
||||||
|
|
||||||
Sharing high-end GPUs or even prosumer & consumer GPUs between multiple users is the most cost-effective
|
Sharing high-end GPUs or even prosumer & consumer GPUs between multiple users is the most cost-effective
|
||||||
way to accelerate AI development. Unfortunately until now the
|
way to accelerate AI development. Unfortunately, until now the
|
||||||
only solution existed applied for MIG/Slicing high-end GPUs (A100+) and required Kubernetes, <br>
|
only existing solution applied for MIG/Slicing high-end GPUs (A100+) and required Kubernetes, <br>
|
||||||
|
|
||||||
🔥 🎉 Welcome To Container Based Fractional GPU For Any Nvidia Card! 🎉 🔥 <br>
|
🔥 🎉 Welcome To Container Based Fractional GPU For Any Nvidia Card! 🎉 🔥 <br>
|
||||||
|
|
||||||
We present pre-packaged containers supporting CUDA 11.x & CUDA 12.x with pre-built hard memory limitation!
|
We present pre-packaged containers supporting CUDA 11.x & CUDA 12.x with pre-built hard memory limitation!
|
||||||
This means multiple containers can be launched on the same GPU ensuring one user cannot allocate the entire host GPU memory!
|
This means multiple containers can be launched on the same GPU, ensuring one user cannot allocate the entire host GPU memory!
|
||||||
(no more greedy processes grabbing the entire GPU memory! finally we have a driver level hard limiting memory option)
|
(No more greedy processes grabbing the entire GPU memory! Finally we have a driver level hard limiting memory option).
|
||||||
|
|
||||||
|
## 🚀 Offerings
|
||||||
|
|
||||||
|
ClearML offers several options to optimize GPU resource utilization by partitioning GPUs:
|
||||||
|
* [**Dynamic GPU Slicing**](https://clear.ml/docs/latest/docs/clearml_agent/clearml_agent_fractional_gpus#dynamic-gpu-fractions):
|
||||||
|
On-demand GPU slicing per task for both MIG and non-MIG devices (available under the ClearML Enterprise plan):
|
||||||
|
* [Bare Metal deployment](https://clear.ml/docs/latest/docs/clearml_agent/clearml_agent_fractional_gpus#bare-metal-deployment)
|
||||||
|
* [Kubernetes deployment](https://clear.ml/docs/latest/docs/clearml_agent/clearml_agent_fractional_gpus#kubernetes-deploymen)
|
||||||
|
* **Container-based Memory Limits** (**this repository**): Use pre-packaged containers with built-in memory limits to
|
||||||
|
run multiple containers on the same GPU (available as part of the ClearML open source offering).
|
||||||
|
* [**Kubernetes-based Static MIG Slicing**](https://clear.ml/docs/latest/docs/clearml_agent/clearml_agent_fractional_gpus#kubernetes-static-mig-fractions):
|
||||||
|
Set up Kubernetes support for NVIDIA MIG (Multi-Instance GPU) to define GPU
|
||||||
|
fractions for specific workloads (available as part of the ClearML open source offering).
|
||||||
|
|
||||||
|
With these options, ClearML enables running AI workloads with optimized hardware utilization and workload performance.
|
||||||
|
This repository covers container-based fractional GPUs. For more information on ClearML's fractional GPU offerings, see
|
||||||
|
the [ClearML documentation](https://clear.ml/docs/latest/docs/clearml_agent/clearml_agent_fractional_gpus).
|
||||||
|
|
||||||
|
![Fractional GPU diagram](docs/fractional_gpu_diagram.png)
|
||||||
|
|
||||||
## ⚡ Installation
|
## ⚡ Installation
|
||||||
|
|
||||||
Pick the container that works for you and launch it
|
Pick the container that works for you and launch it:
|
||||||
```bash
|
```bash
|
||||||
docker run -it --gpus 0 --ipc=host --pid=host clearml/fractional-gpu:u22-cu12.3-8gb bash
|
docker run -it --gpus 0 --ipc=host --pid=host clearml/fractional-gpu:u22-cu12.3-8gb bash
|
||||||
```
|
```
|
||||||
|
|
||||||
To verify fraction gpu memory limit is working correctly, run inside the container:
|
To verify fraction GPU memory limit is working correctly, run inside the container:
|
||||||
```bash
|
```bash
|
||||||
nvidia-smi
|
nvidia-smi
|
||||||
```
|
```
|
||||||
@ -89,15 +110,15 @@ processes and other host processes when limiting memory / utilization usage
|
|||||||
|
|
||||||
## 🔩 Customization
|
## 🔩 Customization
|
||||||
|
|
||||||
Build your own containers and inherit form the original containers
|
Build your own containers and inherit form the original containers.
|
||||||
|
|
||||||
You can find a few examples [here](https://github.com/allegroai/clearml-fractional-gpu/docker-examples).
|
You can find a few examples [here](https://github.com/allegroai/clearml-fractional-gpu/tree/main/examples).
|
||||||
|
|
||||||
## ☸ Kubernetes
|
## ☸ Kubernetes
|
||||||
|
|
||||||
Fractional GPU containers can be used on bare-metal executions as well as Kubernetes PODs.
|
Fractional GPU containers can be used on bare-metal executions as well as Kubernetes PODs.
|
||||||
Yes! By using one the Fractional GPU containers you can limit the memory consumption your Job/Pod and
|
Yes! By using one of the Fractional GPU containers you can limit the memory consumption of your Job/Pod and
|
||||||
allow you to easily share GPUs without fearing they will memory crash one another!
|
easily share GPUs without fearing they will memory crash one another!
|
||||||
|
|
||||||
Here's a simple Kubernetes POD template:
|
Here's a simple Kubernetes POD template:
|
||||||
```yaml
|
```yaml
|
||||||
@ -127,12 +148,12 @@ processes and other host processes when limiting memory / utilization usage
|
|||||||
|
|
||||||
## 🔌 Support & Limitations
|
## 🔌 Support & Limitations
|
||||||
|
|
||||||
The containers support Nvidia drivers <= `545.x.x`
|
The containers support Nvidia drivers <= `545.x.x`.
|
||||||
We will keep updating & supporting new drivers as they continue to be released
|
We will keep updating & supporting new drivers as they continue to be released
|
||||||
|
|
||||||
**Supported GPUs**: RTX series 10, 20, 30, 40, A series, and Data-Center P100, A100, A10/A40, L40/s, H100
|
**Supported GPUs**: RTX series 10, 20, 30, 40, A series, and Data-Center P100, A100, A10/A40, L40/s, H100
|
||||||
|
|
||||||
**Limitations**: Windows Host machines are currently not supported, if this is important for you, leave a request in the [Issues](/issues) section
|
**Limitations**: Windows Host machines are currently not supported. If this is important for you, leave a request in the [Issues](/issues) section
|
||||||
|
|
||||||
## ❓ FAQ
|
## ❓ FAQ
|
||||||
|
|
||||||
@ -140,7 +161,7 @@ We will keep updating & supporting new drivers as they continue to be released
|
|||||||
**A**: Yes, `nvidia-smi` is communicating directly with the low-level drivers and reports both accurate container GPU memory as well as the container local memory limitation.<br>
|
**A**: Yes, `nvidia-smi` is communicating directly with the low-level drivers and reports both accurate container GPU memory as well as the container local memory limitation.<br>
|
||||||
Notice GPU utilization will be the global (i.e. host side) GPU utilization and not the specific local container GPU utilization.
|
Notice GPU utilization will be the global (i.e. host side) GPU utilization and not the specific local container GPU utilization.
|
||||||
|
|
||||||
- **Q**: How do I make sure my Python / Pytorch / Tensorflow are actually memory limited <br>
|
- **Q**: How do I make sure my Python / Pytorch / Tensorflow are actually memory limited? <br>
|
||||||
**A**: For PyTorch you can run: <br>
|
**A**: For PyTorch you can run: <br>
|
||||||
```python
|
```python
|
||||||
import torch
|
import torch
|
||||||
@ -153,8 +174,8 @@ print(f'Free GPU Memory: {cuda.current_context().get_memory_info()}')
|
|||||||
```
|
```
|
||||||
|
|
||||||
- **Q**: Can the limitation be broken by a user? <br>
|
- **Q**: Can the limitation be broken by a user? <br>
|
||||||
**A**: We are sure a malicious user will find a way. It was never our intention to protect against malicious users, <br>
|
**A**: We are sure a malicious user will find a way. It was never our intention to protect against malicious users. <br>
|
||||||
if you have a malicious user with access to your machines, fractional gpus are not your number 1 problem 😃
|
If you have a malicious user with access to your machines, fractional GPUs are not your number 1 problem 😃
|
||||||
|
|
||||||
- **Q**: How can I programmatically detect the memory limitation? <br>
|
- **Q**: How can I programmatically detect the memory limitation? <br>
|
||||||
**A**: You can check the OS environment variable `GPU_MEM_LIMIT_GB`. <br>
|
**A**: You can check the OS environment variable `GPU_MEM_LIMIT_GB`. <br>
|
||||||
@ -164,12 +185,12 @@ Notice that changing it will not remove or reduce the limitation.
|
|||||||
**A**: It should be both secure and safe. The main caveat from a security perspective is that
|
**A**: It should be both secure and safe. The main caveat from a security perspective is that
|
||||||
a container process can see any command line running on the host system.
|
a container process can see any command line running on the host system.
|
||||||
If a process command line contains a "secret" then yes, this might become a potential data leak.
|
If a process command line contains a "secret" then yes, this might become a potential data leak.
|
||||||
Notice that passing "secrets" in command line is ill-advised, and hence we do not consider it a security risk.
|
Notice that passing "secrets" in the command line is ill-advised, and hence we do not consider it a security risk.
|
||||||
That said if security is key, the enterprise edition (see below) eliminate the need to run with `pid-host` and thus fully secure
|
That said if security is key, the enterprise edition (see below) eliminate the need to run with `pid-host` and thus fully secure.
|
||||||
|
|
||||||
- **Q**: Can you run the container **without** `--pid=host` ? <br>
|
- **Q**: Can you run the container **without** `--pid=host` ? <br>
|
||||||
**A**: You can! but you will have to use the enterprise version of the clearml-fractional-gpu container
|
**A**: You can! But you will have to use the enterprise version of the clearml-fractional-gpu container
|
||||||
(otherwise the memory limit is applied system wide instead of container wide). If this feature is important for you, please contact [ClearML sales & support](https://clear.ml/contact-us)
|
(otherwise the memory limit is applied system wide instead of container wide). If this feature is important for you, please contact [ClearML sales & support](https://clear.ml/contact-us).
|
||||||
|
|
||||||
|
|
||||||
## 📄 License
|
## 📄 License
|
||||||
@ -183,12 +204,14 @@ An expanded Commercial license for use within a product or service is available
|
|||||||
ClearML offers enterprise and commercial license adding many additional features on top of fractional GPUs,
|
ClearML offers enterprise and commercial license adding many additional features on top of fractional GPUs,
|
||||||
these include orchestration, priority queues, quota management, compute cluster dashboard,
|
these include orchestration, priority queues, quota management, compute cluster dashboard,
|
||||||
dataset management & experiment management, as well as enterprise grade security and support.
|
dataset management & experiment management, as well as enterprise grade security and support.
|
||||||
Learn more about [ClearML Orchestration](https://clear.ml) or talk to us directly at [ClearML sales](https://clear.ml/contact-us)
|
Learn more about [ClearML Orchestration](https://clear.ml) or talk to us directly at [ClearML sales](https://clear.ml/contact-us).
|
||||||
|
|
||||||
## 📡 How can I help?
|
## 📡 How can I help?
|
||||||
|
|
||||||
Tell everyone about it! #ClearMLFractionalGPU
|
Tell everyone about it! #ClearMLFractionalGPU
|
||||||
|
|
||||||
Join our [Slack Channel](https://joinslack.clear.ml/)
|
Join our [Slack Channel](https://joinslack.clear.ml/)
|
||||||
|
|
||||||
Tell us when things are not working, and help us debug it on the [Issues Page](https://github.com/allegroai/clearml-fractional-gpu/issues)
|
Tell us when things are not working, and help us debug it on the [Issues Page](https://github.com/allegroai/clearml-fractional-gpu/issues)
|
||||||
|
|
||||||
## 🌟 Credits
|
## 🌟 Credits
|
||||||
|
BIN
docs/fractional_gpu_diagram.png
Normal file
BIN
docs/fractional_gpu_diagram.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 220 KiB |
Loading…
Reference in New Issue
Block a user