From 07245dbf855936dce8dcec6977d3c0dc255b823d Mon Sep 17 00:00:00 2001
From: Yineng Zhang <me@zhyncs.com>
Date: Sun, 22 Sep 2024 22:00:47 +0800
Subject: [PATCH] doc: recommend SGLang for DeepSeek V2 inference

blog: https://lmsys.org/blog/2024-09-04-sglang-v0-3/
slides: https://docs.google.com/presentation/d/1wB_Ul0LZwIDL47qFl64b8hVhH1_ya-1YPAPSSv0cKMs
---
 README.md | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/README.md b/README.md
index 0c67025..ed6ff1d 100644
--- a/README.md
+++ b/README.md
@@ -293,6 +293,23 @@ Assistant: {assistant_message_1}<｜end▁of▁sentence｜>User: {user_message_2
 
 Assistant:
 ```
+### Inference with SGLang (recommended)
+
+[SGLang](https://github.com/sgl-project/sglang) currently supports MLA, FP8 (W8A8), FP8 KV Cache, CUDA Graph, and Torch Compile, offering the best performance among open source frameworks. Here are some examples of commands:
+
+```bash
+# fp16 tp8
+python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Instruct --tp 8 --trust-remote-code
+
+# fp16 tp8 w/ torch compile
+python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Instruct --tp 8 --trust-remote-code --enable-torch-compile
+
+# fp16 tp8 w/ torch compile, max torch compile batch size 1
+python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Instruct --tp 8 --trust-remote-code --enable-torch-compile --max-torch-compile-bs 1
+
+# fp8 tp8 w/ torch compile, fp8 e5m2 kv cache
+python3 -m sglang.launch_server --model neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 --tp 8 --trust-remote-code --enable-torch-compile --kv-cache-dtype fp8_e5m2
+```
 
 ### Inference with vLLM (recommended)
 To utilize [vLLM](https://github.com/vllm-project/vllm) for model inference, please merge this Pull Request into your vLLM codebase: https://github.com/vllm-project/vllm/pull/4650.