DeepEP

mirror of https://github.com/deepseek-ai/DeepEP synced 2025-06-26 18:28:11 +00:00

Author	SHA1	Message	Date
Chenggang Zhao	74f4ef7b22	Remove the low-latency usage flag	2025-06-16 13:28:24 +08:00
Zhicheng Wu	05df5554ff	Use one qp per sm for internode normal kernels (#181 ) let the sender SM use the channel_id, and the receiver SM use channel_id + num_channels	2025-06-13 14:37:59 +08:00
Shifang Xu	21efbe9b48	Support UE8M0 data format. (#206 )	2025-06-12 09:38:19 +08:00
Chenggang Zhao	b8d90fb753	Support Ampere architecture (#204 ) * Update README * Update `setup.py` * Fix headers * Add `DISABLE_NVSHMEM` for APIs * Fix launch * Fix TMA settings * Fix TMA usages * Fix dlink * Separate layout kernels * Update version * Add `is_sm90_compiled` * Fix tests * Add NVLink connection checks * Update README * Fix tests * Add some comments * Minor fix * Minor fix * Fix bugs	2025-06-11 15:48:18 +08:00
Chenggang Zhao	dd13c7145c	Check the empty list	2025-06-11 11:14:30 +08:00
Chenggang Zhao	a8299ca7c2	Support CUDA graph for intranode normal kernels (#203 )	2025-06-11 11:08:54 +08:00
Chenggang Zhao	5a2e37fa28	Support statistics tensor for low-latency kernels (#196 )	2025-06-09 15:50:56 +08:00
Chenggang Zhao	0d1a855d81	Add low-latency kernel PCIe usage flag (#195 ) * Add low-latency kernel usage flag * Update comments	2025-06-09 14:37:13 +08:00
Chenggang Zhao	c8dceba110	Use TMA instead of LD/ST for intra-node normal kernels (#191 ) * Update CMake files * Use TMA instead of LD/ST for intranode dispatch * Use TMA instead of LD/ST for intranode combine * Adjust configs * Test default configs as well * More warps for combine * Add inter-thread fence * Enable more warps * Do not use TMA for senders * Update configs * Remove useless wait	2025-06-06 15:40:17 +08:00
Shangyan Zhou	20b2aaaf9e	Refactor some code.	2025-04-22 10:22:30 +08:00
moningchen	5ab80c28f3	In the Internode Normal Kernel, when using nvshmem ibrc for RDMA data transmission, a single QP is used for data transfer between two GPUs, which limits kernel performance in network card dual-port and RoCE network scenarios. In our optimized Internode Normal Kernel, we implemented multiple QPs for data transmission between two GPUs, setting a different QP for each channel. Additionally, we modified the transmission method from IBRC to IBGAD. Through these optimizations, the Internode Normal Kernel achieves optimal performance in both H800 and H20 environments, with RDMA transmission performance nearly reaching the physical network performance limit. Using the current default statistical method, in 4-node H800 and H20 environments, RDMA performance can reach 60GB/s+.	2025-04-21 15:50:39 +08:00
Hao Lin	23c54150ba	Fix test combine args Signed-off-by: Hao Lin <linhaomails@gmail.com>	2025-04-11 18:21:09 +08:00
fujianhao.fjh	0f80da8458	fix: not output result in some linux system	2025-04-10 18:18:30 +08:00
Chenggang Zhao	42494864ba	Remove useless control metadata for low-latency combine	2025-04-07 09:55:39 +08:00
Chenggang Zhao	26fa72d80f	Fix zero-copy mode tests	2025-03-28 16:49:33 +08:00
Chenggang Zhao	ae0eafd2be	Remove confusing comments	2025-03-25 09:27:34 +08:00
Chenggang Zhao	dcaf73e5ff	Support zero-copy for low-latency combine	2025-03-18 15:41:50 +08:00
Dmytro Dzhulgakov	b3b61ef5ef	Allow passing output tensor in low_latency_combine	2025-03-10 22:19:21 +00:00
Chenggang Zhao	ed7487c15e	Support BF16 for low-latency kernels	2025-03-10 17:24:41 +08:00
Chenggang Zhao	458cdcb22a	Fix AR bugs for normal kernels	2025-03-05 17:13:35 +08:00
Chenggang Zhao	1553fc42bf	Improve EP2/4 performance	2025-03-04 15:34:33 +08:00
Chenggang Zhao	c5b4040502	Enable intranode kernel tests with EP2 and EP4	2025-03-03 15:01:02 +08:00
Chenggang Zhao	ebfe47e46f	Initial commit	2025-02-25 09:07:53 +08:00

23 Commits