From 20d316b940944a77ef914cc9bcabe604fe51dc69 Mon Sep 17 00:00:00 2001 From: stack-heap-overflow <37035235+stack-heap-overflow@users.noreply.github.com> Date: Thu, 30 Nov 2023 15:27:01 +0800 Subject: [PATCH] Update README.md (#5) --- README.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/README.md b/README.md index e0337e3..c1db533 100644 --- a/README.md +++ b/README.md @@ -324,6 +324,17 @@ python convert-hf-to-gguf.py --outfile --model-name dee `UPDATE:`[exllamav2](https://github.com/turboderp/exllamav2) has been able to support HuggingFace Tokenizer. Please pull the latest version and try out. +### GPU Memory Usage + +We profile the peak memory usage of inference for 7B and 67B models at different batch size and sequence length settings. + +For DeepSeek LLM 7B, we utilize **1 NVIDIA A100-PCIE-40GB GPU** for inference. + +
Batch SizeSequence Length
256512102420484096
113.29 GB13.63 GB14.47 GB16.37 GB21.25 GB
213.63 GB14.39 GB15.98 GB19.82 GB29.59 GB
414.47 GB15.82 GB19.04 GB26.65 GBOOM
815.99 GB18.71 GB25.14 GB35.19 GBOOM
1619.06 GB24.52 GB37.28 GBOOMOOM
+ +For DeepSeek LLM 67B, we utilize **8 NVIDIA A100-PCIE-40GB GPUs** for inference. + +
Batch SizeSequence Length
256512102420484096
116.92 GB17.11 GB17.66 GB20.01 GB33.23 GB
217.04 GB17.28 GB18.55 GB25.27 GBOOM
417.20 GB17.80 GB21.28 GB33.71 GBOOM
817.59 GB19.25 GB25.69 GBOOMOOM
1618.17 GB21.69 GB34.54 GBOOMOOM
## 7. Limitation