DeepEP

mirror of https://github.com/deepseek-ai/DeepEP synced 2025-05-06 13:04:22 +00:00

Author	SHA1	Message	Date
moningchen	5ab80c28f3	In the Internode Normal Kernel, when using nvshmem ibrc for RDMA data transmission, a single QP is used for data transfer between two GPUs, which limits kernel performance in network card dual-port and RoCE network scenarios. In our optimized Internode Normal Kernel, we implemented multiple QPs for data transmission between two GPUs, setting a different QP for each channel. Additionally, we modified the transmission method from IBRC to IBGAD. Through these optimizations, the Internode Normal Kernel achieves optimal performance in both H800 and H20 environments, with RDMA transmission performance nearly reaching the physical network performance limit. Using the current default statistical method, in 4-node H800 and H20 environments, RDMA performance can reach 60GB/s+.	2025-04-21 15:50:39 +08:00
fzyzcjy	218c5a1f96	Update buffer.py	2025-04-03 10:57:45 +08:00
fzyzcjy	36b5c27993	Update buffer.py	2025-03-25 09:12:36 +08:00
Chenggang Zhao	dcaf73e5ff	Support zero-copy for low-latency combine	2025-03-18 15:41:50 +08:00
Dmytro Dzhulgakov	50ac280ae7	comments	2025-03-13 00:42:08 +00:00
Dmytro Dzhulgakov	b3b61ef5ef	Allow passing output tensor in low_latency_combine	2025-03-10 22:19:21 +00:00
Chenggang Zhao	ed7487c15e	Support BF16 for low-latency kernels	2025-03-10 17:24:41 +08:00
Chenggang Zhao	458cdcb22a	Fix AR bugs for normal kernels	2025-03-05 17:13:35 +08:00
Chenggang Zhao	1553fc42bf	Improve EP2/4 performance	2025-03-04 15:34:33 +08:00
Chenggang Zhao	2a3cac903a	Add some docs	2025-03-04 10:19:42 +08:00
Chenggang Zhao	3885404ffb	Add `NVSHMEM_IB_ENABLE_RELAXED_ORDERING`	2025-02-26 17:54:12 +08:00
Chenggang Zhao	ebfe47e46f	Initial commit	2025-02-25 09:07:53 +08:00

12 Commits