Fix the performance data.

This commit is contained in:
Shangyan Zhou 2025-04-22 11:23:42 +08:00
parent edbb1bc3ff
commit 3b1045db43

View File

@ -17,11 +17,11 @@ We test normal kernels on H800 (~160 GB/s NVLink maximum bandwidth), with each c
| Type | Dispatch #EP | Bottleneck bandwidth | Combine #EP | Bottleneck bandwidth |
|:---------:|:------------:|:--------------------:|:-----------:|:--------------------:|
| Intranode | 8 | 153 GB/s (NVLink) | 8 | 158 GB/s (NVLink) |
| Internode | 16 | 47 GB/s (RDMA) | 16 | 62 GB/s (RDMA) |
| Internode | 32 | 59 GB/s (RDMA) | 32 | 60 GB/s (RDMA) |
| Internode | 64 | 49 GB/s (RDMA) | 64 | 51 GB/s (RDMA) |
| Internode | 16 | 43 GB/s (RDMA) | 16 | 43 GB/s (RDMA) |
| Internode | 32 | 58 GB/s (RDMA) | 32 | 57 GB/s (RDMA) |
| Internode | 64 | 51 GB/s (RDMA) | 64 | 50 GB/s (RDMA) |
**News (2025.04.22)**: the performance is optimized by 5-35% by Tencent Network Platform Department, see [#130](https://github.com/deepseek-ai/DeepEP/pull/130) for more details. Thanks for the contribution!
**News (2025.04.22)**: with optimizations from Tencent Network Platform Department, performance was enhanced by up to 30%, see [#130](https://github.com/deepseek-ai/DeepEP/pull/130) for more details. Thanks for the contribution!
### Low-latency kernels with pure RDMA