mirror of
https://github.com/clearml/clearml-fractional-gpu
synced 2025-01-30 18:36:58 +00:00
Initial example
This commit is contained in:
parent
e2deeb59d5
commit
f036004e3e
58
LICENSE.md
Normal file
58
LICENSE.md
Normal file
@ -0,0 +1,58 @@
|
||||
---
|
||||
title: Personal, Research, Development & Educational License Agreement
|
||||
---
|
||||
# Personal, Research, Development & Educational License Agreement
|
||||
|
||||
PLEASE CAREFULLY REVIEW THE FOLLOWING TERMS AND CONDITIONS BEFORE DOWNLOADING AND USING THE LICENSED MATERIALS. THIS LICENSE AGREEMENT (“AGREEMENT”) IS A LEGAL AGREEMENT BETWEEN YOU (EITHER A SINGLE INDIVIDUAL, OR A SINGLE LEGAL ENTITY)(“YOU”) AND CLEARML (“CLEARML”) COVERING THE CLEARML FRACTIONAL GPU PRODUCT.
|
||||
|
||||
By downloading and/or using or installing products from ClearML you automatically agree to and are bound by the terms and conditions of this agreement.
|
||||
|
||||
PLEASE NOTE THAT THIS AGREEMENT IS INTENDED FOR NON-COMMERCIAL USE OF THE PRODUCT. IF YOU INTENT TO USE CLEARML PRODUCTS FOR COMMERCIAL PURPOSES, THEN PLEASE CONTACT sales@clear.ml TO ARRANGE AN AGREEMENT WITH US BASED ON OUR COMMERCIAL LICENSE TERMS
|
||||
|
||||
## 1. DEFINITIONS
|
||||
|
||||
“Intellectual Property” means any or all of the following and all rights in, arising out of, or associated with:
|
||||
|
||||
|
||||
1. all inventions (whether patentable or not), invention disclosures, improvements, trade secrets, proprietary information, know how, technology, algorithms, techniques, methods, devices, technical data, customer lists, and all documentation embodying or evidencing any of the foregoing;
|
||||
2. all computer software, source codes, object codes, firmware, development tools, files, records, data, and all media on which any of the foregoing is recorded
|
||||
|
||||
“Product” means the software provided by ClearML.
|
||||
|
||||
“You” the opposite contract party, being the party to whom an offer is made by ClearML, or with whom an agreement is concluded by ClearML, or to whom the Product is supplied.
|
||||
|
||||
## 2. LICENSE TO USE
|
||||
|
||||
ClearML hereby grants you the following limited, non-exclusive, non-transferable, no-charge, and royalty-free licenses to use, modify, and distribute the Product provided you do so for non-commercial (personal, educational, research and development, demonstration) purposes:
|
||||
|
||||
1. Copyright license
|
||||
2. Patent license, where such license only applies to those patent claims licensable by ClearML.
|
||||
|
||||
Specifically you are allowed to:
|
||||
|
||||
1. Use the Product in your design to create, simulate, implement, manufacture, research or build any software or hardware product as long as you don't do so to make a profit directly from it.
|
||||
2. Distribute the Product, provided the original disclaimer and copyright notice are retained and this Agreement is part of the distribution and as long as it is not a part of your own product or service.
|
||||
|
||||
## 3. OWNERSHIP
|
||||
|
||||
The Product, its documentation, and any associated material is owned by ClearML and is protected by copyright and other intellectual property right laws.
|
||||
|
||||
Any modification or addition to the Product, documentation, and any associated materials or derivatives thereof, that You intentionally submit to ClearML for inclusion in the Product will become part of the Product and thus owned and copyrighted by ClearML.
|
||||
|
||||
By submitting any material for inclusion you wave any ownership, copyright, and patent rights and claims for the use of the submitted material in the Product. “Submitting” means any form of electronic, verbal, or written communication sent to ClearML or its representatives, including, but not limited to, email, mailing lists, source repositories, and issue tracking systems for the purpose of discussing and improving the Product.
|
||||
|
||||
You shall not remove any copyright, disclaimers, or other notices from any parts of the Product.
|
||||
|
||||
## 4. RIGHT OF EQUITABLE RELIEF
|
||||
|
||||
You acknowledge and agree that violation of this agreement may cause ClearML irreparable injury for which an adequate remedy at law may not be available. Therefore ClearML shall be entitled to seek all remedies that may be available under equity, including immediate injunctive relief, in addition to whatever remedies may be available at law.
|
||||
|
||||
## 5. DISCLAIMER OF WARRANTY
|
||||
|
||||
The Product is provided “AS IS”. ClearML has no obligation to provide maintenance or support services in connection with the Product.
|
||||
|
||||
CLEARML DISCLAIMS ALL WARRANTIES, CONDITIONS AND REPRESENTATIONS, EXPRESS, IMPLIED, OR STATUTORY, INCLUDING, BUT NOT LIMITED TO, THOSE RELATED TO MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, SATISFACTORY QUALITY, ACCURACY OR COMPLETENESS OR RESULTS, CONFORMANCE WITH DESCRIPTION, AND NON-INFRINGEMENT.
|
||||
|
||||
## 6. LIMITATION OF LIABILITY
|
||||
|
||||
TO THE MAXIMUM EXTENT PERMITTED BY LAW, IN NO EVENT SHALL CLEARML BE LIABLE TO YOU OR ANY THIRD PARTY FOR ANY INDIRECT, SPECIAL, CONSEQUENTIAL OR INCIDENTAL DAMAGES WHATSOEVER (INCLUDING, BUT NOT LIMITED TO, DAMAGES FOR LOSS OF PROFIT, BUSINESS INTERRUPTIONS OR LOSS OF INFORMATION) ARISING OUT OF THE USE OR INABILITY TO USE THE PRODUCT WHETHER BASED ON A CLAIM UNDER CONTRACT, TORT OR OTHER LEGAL THEORY, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
|
189
README.md
189
README.md
@ -1 +1,188 @@
|
||||
# clearml-fractional-gpu
|
||||
# 🚀 🔥 Fractional GPU! ⚡ 📣
|
||||
## Run multiple containers on the same GPU with driver level memory limitation ✨ and compute time-slicing 🎊
|
||||
|
||||
## 🔰 Introduction
|
||||
|
||||
Sharing high-end GPUs or even prosumer & consumer GPUs between multiple users is the most cost-effective
|
||||
way to accelerate AI development. Unfortunately until now the
|
||||
only available solution applied for MIG/Slicing high-end GPUs (A100+) and required Kubernetes, <br>
|
||||
🔥 🎉 Welcome To Container Based Fractional GPU For Any Nvidia Card! 🎉 🔥
|
||||
We present pre-packaged containers supporting CUDA 11.x & CUDA 12.x with **pre-built hard memory limitation!**
|
||||
This means multiple containers can be launched on the same GPU ensuring one user does not allocate the entire host GPU memory!
|
||||
(no more greedy processes grabbing the entire GPU memory! finally we have a driver level hard limiting memory option)
|
||||
|
||||
## ⚡ Installation
|
||||
|
||||
Pick the container that works for you and launch it
|
||||
```bash
|
||||
docker run -it --gpus 0 --ipc=host --pid=host allegroai/fractional-gpu-20.04-cuda-12.3-8gb bash
|
||||
```
|
||||
|
||||
To verify fraction gpu memory limit is working correctly, run inside the container:
|
||||
```bash
|
||||
nvidia-smi
|
||||
```
|
||||
Here is en example output from A100 GPU:
|
||||
|
||||
```
|
||||
+---------------------------------------------------------------------------------------+
|
||||
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|
||||
|-----------------------------------------+----------------------+----------------------+
|
||||
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
|
||||
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
|
||||
| | | MIG M. |
|
||||
|=========================================+======================+======================|
|
||||
| 0 A100-PCIE-40GB Off | 00000000:01:00.0 Off | N/A |
|
||||
| 32% 33C P0 66W / 250W | 0MiB / 8128MiB | 3% Default |
|
||||
| | | Disabled |
|
||||
+-----------------------------------------+----------------------+----------------------+
|
||||
|
||||
+---------------------------------------------------------------------------------------+
|
||||
| Processes: |
|
||||
| GPU GI CI PID Type Process name GPU Memory |
|
||||
| ID ID Usage |
|
||||
|=======================================================================================|
|
||||
+---------------------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
### Available Container Images
|
||||
|
||||
| Memory Limit | CUDA Ver | Ubuntu Ver | Docker Image |
|
||||
|:-------------:|:--------:|:----------:|:----------------------------------------------------:|
|
||||
| 8 GiB | 12.3 | 22.04 | `allegroai/clearml-fractional-gpu:u22.04-cu12.3-8gb` |
|
||||
| 8 GiB | 12.3 | 20.04 | `allegroai/clearml-fractional-gpu:u20.04-cu12.3-8gb` |
|
||||
| 8 GiB | 11.1 | 22.04 | `allegroai/clearml-fractional-gpu:u22.04-cu11.1-8gb` |
|
||||
| 8 GiB | 11.1 | 20.04 | `allegroai/clearml-fractional-gpu:u20.04-cu11.1-8gb` |
|
||||
| 4 GiB | 12.3 | 22.04 | `allegroai/clearml-fractional-gpu:u22.04-cu12.3-4gb` |
|
||||
| 4 GiB | 12.3 | 20.04 | `allegroai/clearml-fractional-gpu:u20.04-cu12.3-4gb` |
|
||||
| 4 GiB | 11.1 | 22.04 | `allegroai/clearml-fractional-gpu:u22.04-cu11.1-4gb` |
|
||||
| 4 GiB | 11.1 | 20.04 | `allegroai/clearml-fractional-gpu:u20.04-cu11.1-4gb` |
|
||||
| 2 GiB | 12.3 | 22.04 | `allegroai/clearml-fractional-gpu:u22.04-cu12.3-2gb` |
|
||||
| 2 GiB | 12.3 | 20.04 | `allegroai/clearml-fractional-gpu:u20.04-cu12.3-2gb` |
|
||||
| 2 GiB | 11.1 | 22.04 | `allegroai/clearml-fractional-gpu:u22.04-cu11.1-2gb` |
|
||||
| 2 GiB | 11.1 | 20.04 | `allegroai/clearml-fractional-gpu:u20.04-cu11.1-2gb` |
|
||||
| 1 GiB | 12.3 | 22.04 | `allegroai/clearml-fractional-gpu:u22.04-cu12.3-1gb` |
|
||||
| 1 GiB | 12.3 | 20.04 | `allegroai/clearml-fractional-gpu:u20.04-cu12.3-1gb` |
|
||||
| 1 GiB | 11.1 | 22.04 | `allegroai/clearml-fractional-gpu:u22.04-cu11.1-1gb` |
|
||||
| 1 GiB | 11.1 | 20.04 | `allegroai/clearml-fractional-gpu:u20.04-cu11.1-1gb` |
|
||||
|
||||
|
||||
> [!IMPORTANT]
|
||||
>
|
||||
> You must execute the container with `--pid=host` !
|
||||
|
||||
> [!NOTE]
|
||||
>
|
||||
> **`--pid=host`** is required to allow the driver to differentiate between the container's
|
||||
processes and other host processes when limiting memory / utilization usage
|
||||
|
||||
> [!TIP]
|
||||
>
|
||||
> **[ClearML-Agent](https://clear.ml/docs/latest/docs/clearml_agent/) users add `[--pid=host]` to your `agent.extra_docker_arguments` section in your [config file](https://github.com/allegroai/clearml-agent/blob/c9fc092f4eea9c3890d582aa2a098c3c2f39ce72/docs/clearml.conf#L190)**
|
||||
|
||||
|
||||
## 🔩 Customization
|
||||
|
||||
Build your own containers inheriting from the original containers
|
||||
|
||||
You can find a few examples [here](https://github.com/allegroai/clearml-fractional-gpu/examples).
|
||||
|
||||
## 🌸 Implications
|
||||
|
||||
Our fractional GPU containers can be used on bare-metal executions as well as Kubernetes PODs.
|
||||
Yes! By using one our Fractional GPU containers you can limit the memory consumption your Job/Pod and
|
||||
allow you to easily share GPUs without fearing they will memory crash one another!
|
||||
|
||||
Here's a simple Kubernetes POD template:
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: train-pod
|
||||
labels:
|
||||
app: trainme
|
||||
spec:
|
||||
hostPID: true
|
||||
containers:
|
||||
- name: train-container
|
||||
image: allegroai/fractional-gpu-u22.04-cu12.3-8gb
|
||||
command: ['python3', '-c', 'print(f"Free GPU Memory: (free, global) {torch.cuda.mem_get_info()}")']
|
||||
```
|
||||
|
||||
> [!IMPORTANT]
|
||||
>
|
||||
> You must execute the pod with `hostPID: true` !
|
||||
|
||||
> [!NOTE]
|
||||
>
|
||||
> **`hostPID: true`** is required to allow the driver to differentiate between the pod's
|
||||
processes and other host processes when limiting memory / utilization usage
|
||||
|
||||
|
||||
## 🔌 Support & Limitations
|
||||
|
||||
The containers support Nvidia drivers <= `545.x.x`
|
||||
We will keep updating & supporting new drivers as they continue to be released
|
||||
|
||||
**Supported GPUs**: GTX series 10, 20, 30, 40, RTX A series, and Data-Center P100, A100, A10/A40, L40/s, H100
|
||||
|
||||
## ❓ FAQ
|
||||
|
||||
- **Q**: Will running `nvidia-smi` inside the container report the local processes' GPU consumption? <br>
|
||||
**A**: Yes, `nvidia-smi` is communicating directly with the low-level drivers and reports both accurate container GPU memory as well as the container local memory limitation.<br>
|
||||
Notice GPU utilization will be the global (i.e. host side) GPU utilization and not the specific local container GPU utilization.
|
||||
|
||||
- **Q**: How do I make sure my Python / Pytorch / Tensorflow are actually memory limited <br>
|
||||
**A**: For PyTorch you can run:
|
||||
```python
|
||||
import torch
|
||||
print(f'Free GPU Memory: (free, global) {torch.cuda.mem_get_info()}')
|
||||
```
|
||||
Numba example:
|
||||
```python
|
||||
from numba import cuda
|
||||
print(f'Free GPU Memory: {cuda.current_context().get_memory_info()}')
|
||||
```
|
||||
|
||||
- **Q**: Can the limitation be broken by a user?
|
||||
**A**: We are sure a malicious user will find a way. It was never our intention to protect against malicious users,
|
||||
if you have a malicious user with access to your machines, fractional gpus are not your number 1 problem 😃
|
||||
- **Q**: How can I programmatically detect the memory limitation?
|
||||
**A**: You can check the OS environment variable `GPU_MEM_LIMIT_GB`.
|
||||
Notice that changing it will not remove or modify the limitation.
|
||||
|
||||
- **Q**: Is running the container **with** `--pid=host` secure / safe?
|
||||
**A**: It should be both secure and safe. The main caveat from a security perspective is that
|
||||
a container process can see any command line running on the host system.
|
||||
If a process command line contains a "secret" then yes, this might become a potential data leak.
|
||||
Notice that passing "secrets" in command line is ill-advised, and hence we do not consider it a security risk.
|
||||
That said if security is key, the enterprise edition (see below) eliminates the need to run with `pid-host` and is thus fully secure
|
||||
|
||||
- **Q**: Can you run the container **without** `--pid=host` ?
|
||||
**A**: You can! but you will have to use the enterprise version of the clearml-fractional-gpu container
|
||||
(otherwise the memory limit is applied system wide instead of container wide). If this feature is important for you, please contact [ClearML sales & support](https://clear.ml/contact-us)
|
||||
|
||||
|
||||
## 📄 License
|
||||
|
||||
Usage license is granted for **personal**, **research**, **development** or **educational** purposes only.
|
||||
|
||||
Commercial license is available as part of the [ClearML commercial solution](https://clear.ml)
|
||||
|
||||
## 🤖 Commercial & Enterprise version
|
||||
|
||||
ClearML offers enterprise and commercial license adding many additional features on top of fractional GPUs,
|
||||
these include orchestration, priority queues, quota management, compute cluster dashboard,
|
||||
dataset management & experiment management, as well as enterprise grade security and support.
|
||||
Learn more about [ClearML Orchestration](https://clear.ml) or talk to us directly at [ClearML sales](https://clear.ml/contact-us)
|
||||
|
||||
## 📡 How can I help?
|
||||
|
||||
Tell everyone about it! #ClearMLFractionalGPU
|
||||
Join our [Slack Channel](https://joinslack.clear.ml/)
|
||||
Tell us when things are not working, and help us debug it on the [Issues Page](https://github.com/allegroai/clearml-fractional-gpu/issues)
|
||||
|
||||
## 🌟 Credits
|
||||
|
||||
This product is brought to you by the ClearML team with ❤️
|
||||
|
||||
|
5
examples/Dockerfile
Normal file
5
examples/Dockerfile
Normal file
@ -0,0 +1,5 @@
|
||||
FROM allegroai/clearml-fractional-gpu:u22.04-cu12.3-8gb
|
||||
|
||||
# upgrade torch to the latest version
|
||||
RUN pip3 install -U torch torchvision torchaudio torchdata torchmetrics torchrec torchtext
|
||||
|
Loading…
Reference in New Issue
Block a user