DeepEP

mirror of https://github.com/deepseek-ai/DeepEP synced 2025-06-26 18:28:11 +00:00

Author	SHA1	Message	Date
Chenggang Zhao	dd13c7145c	Check the empty list	2025-06-11 11:14:30 +08:00
Chenggang Zhao	a8299ca7c2	Support CUDA graph for intranode normal kernels (#203 )	2025-06-11 11:08:54 +08:00
Chenggang Zhao	5a2e37fa28	Support statistics tensor for low-latency kernels (#196 )	2025-06-09 15:50:56 +08:00
Chenggang Zhao	0d1a855d81	Add low-latency kernel PCIe usage flag (#195 ) * Add low-latency kernel usage flag * Update comments	2025-06-09 14:37:13 +08:00
fzyzcjy	4cd951700e	more	2025-06-07 21:39:00 +08:00
Chenggang Zhao	c8dceba110	Use TMA instead of LD/ST for intra-node normal kernels (#191 ) * Update CMake files * Use TMA instead of LD/ST for intranode dispatch * Use TMA instead of LD/ST for intranode combine * Adjust configs * Test default configs as well * More warps for combine * Add inter-thread fence * Enable more warps * Do not use TMA for senders * Update configs * Remove useless wait	2025-06-06 15:40:17 +08:00
Shangyan Zhou	df4debe30c	Reduce NVSHMEM gpu memory usage and disable MNNVL. (#190 ) Co-authored-by: Shangyan Zhou <sy.zhou@deepseek.com>	2025-06-06 13:25:43 +08:00
Chenggang Zhao	aae9fa9a6d	Allow NVLink traffic for low-latency kernels by default	2025-05-23 20:14:50 +08:00
Chenggang Zhao	92405ddf30	Code cleanup and bug fixed	2025-05-23 11:14:16 +08:00
cywork121	68ae8b3d07	Feature: LL nvlink p2p (#173 )	2025-05-23 10:37:45 +08:00
Chenggang Zhao	edbb1bc3ff	Several code lints	2025-04-22 10:52:10 +08:00
Shangyan Zhou	3e54b78fd7	Normal kernels always use IBGDA mode.	2025-04-22 10:36:24 +08:00
moningchen	5ab80c28f3	In the Internode Normal Kernel, when using nvshmem ibrc for RDMA data transmission, a single QP is used for data transfer between two GPUs, which limits kernel performance in network card dual-port and RoCE network scenarios. In our optimized Internode Normal Kernel, we implemented multiple QPs for data transmission between two GPUs, setting a different QP for each channel. Additionally, we modified the transmission method from IBRC to IBGAD. Through these optimizations, the Internode Normal Kernel achieves optimal performance in both H800 and H20 environments, with RDMA transmission performance nearly reaching the physical network performance limit. Using the current default statistical method, in 4-node H800 and H20 environments, RDMA performance can reach 60GB/s+.	2025-04-21 15:50:39 +08:00
fzyzcjy	218c5a1f96	Update buffer.py	2025-04-03 10:57:45 +08:00
fzyzcjy	36b5c27993	Update buffer.py	2025-03-25 09:12:36 +08:00
Chenggang Zhao	dcaf73e5ff	Support zero-copy for low-latency combine	2025-03-18 15:41:50 +08:00
Dmytro Dzhulgakov	50ac280ae7	comments	2025-03-13 00:42:08 +00:00
Dmytro Dzhulgakov	b3b61ef5ef	Allow passing output tensor in low_latency_combine	2025-03-10 22:19:21 +00:00
Chenggang Zhao	ed7487c15e	Support BF16 for low-latency kernels	2025-03-10 17:24:41 +08:00
Chenggang Zhao	458cdcb22a	Fix AR bugs for normal kernels	2025-03-05 17:13:35 +08:00
Chenggang Zhao	1553fc42bf	Improve EP2/4 performance	2025-03-04 15:34:33 +08:00
Chenggang Zhao	2a3cac903a	Add some docs	2025-03-04 10:19:42 +08:00
Chenggang Zhao	3885404ffb	Add `NVSHMEM_IB_ENABLE_RELAXED_ORDERING`	2025-02-26 17:54:12 +08:00
Chenggang Zhao	ebfe47e46f	Initial commit	2025-02-25 09:07:53 +08:00

24 Commits