DeepEP

mirror of https://github.com/deepseek-ai/DeepEP synced 2025-06-26 18:28:11 +00:00

Author	SHA1	Message	Date
Chenggang Zhao	8da2d7b38d	Fully remove barrier FIFO designs (#200 ) * Fully remove FIFO slots * Fully remove FIFO buffers * Minor fix styles * Fix some typos * Bugs fixed * Cleanup `ibgda_poll_cq`	2025-06-10 16:23:20 +08:00
Shangyan Zhou	a16af40531	Merge pull request #201 from youkaichao/no_gdrcopy remove the dependency of gdrcopy	2025-06-10 16:11:32 +08:00
youkaichao	b9b7ce348b	update readme Signed-off-by: youkaichao <youkaichao@gmail.com>	2025-06-10 15:49:50 +08:00
youkaichao	97be5a3873	update the patch Signed-off-by: youkaichao <youkaichao@gmail.com>	2025-06-10 15:39:44 +08:00
Chenggang Zhao	1157693c0c	Remove useless comments	2025-06-09 17:14:25 +08:00
Chenggang Zhao	5a2e37fa28	Support statistics tensor for low-latency kernels (#196 )	2025-06-09 15:50:56 +08:00
Chenggang Zhao	0d1a855d81	Add low-latency kernel PCIe usage flag (#195 ) * Add low-latency kernel usage flag * Update comments	2025-06-09 14:37:13 +08:00
Chenggang Zhao	564e375234	Fix `< PTX ISA 8.6` compatibility (#194 )	2025-06-09 10:48:42 +08:00
Shangyan Zhou	11a0b0e1a3	Merge pull request #193 from fzyzcjy/feat/fix_mnnvl Allow using MNNVL	2025-06-08 13:05:08 +08:00
fzyzcjy	4cd951700e	more	2025-06-07 21:39:00 +08:00
Chenggang Zhao	c8dceba110	Use TMA instead of LD/ST for intra-node normal kernels (#191 ) * Update CMake files * Use TMA instead of LD/ST for intranode dispatch * Use TMA instead of LD/ST for intranode combine * Adjust configs * Test default configs as well * More warps for combine * Add inter-thread fence * Enable more warps * Do not use TMA for senders * Update configs * Remove useless wait	2025-06-06 15:40:17 +08:00
Shangyan Zhou	df4debe30c	Reduce NVSHMEM gpu memory usage and disable MNNVL. (#190 ) Co-authored-by: Shangyan Zhou <sy.zhou@deepseek.com>	2025-06-06 13:25:43 +08:00
Chenggang Zhao	d8dd185c68	Update README	2025-06-05 14:41:51 +08:00
Shangyan Zhou	de8cfca3cf	Update readme.	2025-06-05 09:59:58 +08:00
Shangyan Zhou	fc48a467a7	Merge pull request #182 from wzc-wuzhicheng/fix-notify-dispatch Fix notify_dispatch: using warp 0 to issue send	2025-06-05 09:29:16 +08:00
wzc.wuzhicheng	d0225df27d	Fix notify_dispatch: using warp 0 to issue send Signed-off-by: wzc.wuzhicheng <wzc.wuzhicheng@linux.alibaba.com>	2025-06-03 20:20:02 +08:00
Shangyan Zhou	9fe9021f29	Use IBGDA only (#177 )	2025-05-28 16:40:14 +08:00
Chenggang Zhao	aae9fa9a6d	Allow NVLink traffic for low-latency kernels by default	2025-05-23 20:14:50 +08:00
Shangyan Zhou	8da1b1f81e	Merge pull request #174 from deepseek-ai/p2p-refactor Low-latency P2P code cleanup and bug fixed	2025-05-23 11:25:38 +08:00
Chenggang Zhao	92405ddf30	Code cleanup and bug fixed	2025-05-23 11:14:16 +08:00
cywork121	68ae8b3d07	Feature: LL nvlink p2p (#173 )	2025-05-23 10:37:45 +08:00
guyueh1	d5ca4495c0	Make `TORCH_CUDA_ARCH_LIST` as an environment variable (#167 ) * Add 10.0 to TORCH_CUDA_ARCH_LIST Signed-off-by: Guyue Huang <guyueh@nvidia.com> * Revert csrc/CMakeLists change; in setup.py make TORCH_CUDA_ARCH_LIST configurable Signed-off-by: Guyue Huang <guyueh@nvidia.com> --------- Signed-off-by: Guyue Huang <guyueh@nvidia.com>	2025-05-19 09:43:48 +08:00
Chenggang Zhao	bb393e7760	Merge pull request #154 from sleepcoo/support-more-hidden Support hidden size 4096	2025-05-12 16:55:09 +08:00
sleepcoo	a107266a4e	support hidden size 4096 Co-authored-by: zhyncs <me@zhyncs.com> Co-authored-by: yinfan98 <1106310035@qq.com>	2025-05-12 16:41:21 +08:00
Shangyan Zhou	05104029fd	Merge pull request #151 from vicoooo26/feat/nvidia-peer-mem-detection Feat: enhance nvidia peer memory detection	2025-05-12 09:32:47 +08:00
Chenggang Zhao	f0a9f10629	Merge pull request #153 from wangfakang/opt-shuffled_dst Shuffling the starting index of target rank for different ranks and channels	2025-05-12 09:25:21 +08:00
wangfakang	63c29d06a0	To mitigate incast congestion, shuffle the starting index of target rank for different ranks and channels Signed-off-by: wangfakang <fakangwang@gmail.com>	2025-05-10 09:55:35 +08:00
Vico Chu	c6051f3880	Feat: enhance nvidia peer memory detection	2025-05-09 17:12:07 +08:00
Chenggang Zhao	9056a6db95	Merge pull request #142 from fzyzcjy/patch-3 Fix DeepEP cannot be used together with code that needs GIL such as Mooncake transfer engine	2025-05-08 16:04:54 +08:00
fzyzcjy	adc6e24cb0	Update deep_ep.cpp	2025-05-08 16:01:47 +08:00
fzyzcjy	23ded3bd8d	Update deep_ep.cpp	2025-04-29 09:58:31 +08:00
Shangyan Zhou	65e2a700f0	Merge pull request #135 from deepseek-ai/add-iw-fork Add Infrawaves' fork to README.	2025-04-27 10:51:18 +08:00
Shangyan Zhou	1a0c8f6425	Add Infrawaves' fork to README.	2025-04-27 10:37:30 +08:00
Chenggang Zhao	007fcfcf97	Merge pull request #130 from deepseek-ai/trmt/internode_multi_qp Support multi-QP for normal kernels	2025-04-22 13:04:42 +08:00
Shangyan Zhou	e255d57bef	Use `put_nbi_warp`.	2025-04-22 12:29:46 +08:00
Shangyan Zhou	3b1045db43	Fix the performance data.	2025-04-22 11:23:42 +08:00
Chenggang Zhao	edbb1bc3ff	Several code lints	2025-04-22 10:52:10 +08:00
Shangyan Zhou	3e54b78fd7	Normal kernels always use IBGDA mode.	2025-04-22 10:36:24 +08:00
Shangyan Zhou	20b2aaaf9e	Refactor some code.	2025-04-22 10:22:30 +08:00
moningchen	c07fdd197c	Merge branch 'trmt/internode_multi_qp' of github.com:deepseek-ai/DeepEP into trmt/internode_multi_qp	2025-04-21 21:31:49 +08:00
moningchen	e0eaaf94fb	Add the performance data after internode optimization in the Readme file	2025-04-21 21:30:08 +08:00
Shangyan Zhou	e2c578485c	Revert `ibgda_device.cuh` and remove some comments.	2025-04-21 17:44:32 +08:00
moningchen	5ab80c28f3	In the Internode Normal Kernel, when using nvshmem ibrc for RDMA data transmission, a single QP is used for data transfer between two GPUs, which limits kernel performance in network card dual-port and RoCE network scenarios. In our optimized Internode Normal Kernel, we implemented multiple QPs for data transmission between two GPUs, setting a different QP for each channel. Additionally, we modified the transmission method from IBRC to IBGAD. Through these optimizations, the Internode Normal Kernel achieves optimal performance in both H800 and H20 environments, with RDMA transmission performance nearly reaching the physical network performance limit. Using the current default statistical method, in 4-node H800 and H20 environments, RDMA performance can reach 60GB/s+.	2025-04-21 15:50:39 +08:00
Shangyan Zhou	a84a24808f	Merge pull request #124 from wplf/patch-1 Fix typo in nvshmem.patch	2025-04-16 10:57:31 +08:00
李金梁	a2ccc95d78	Fix typo in nvshmem.patch	2025-04-16 10:30:38 +08:00
Chenggang Zhao	a0c69317ab	Merge pull request #118 from andylin-hao/main Fix test combine args	2025-04-14 15:51:30 +08:00
Shangyan Zhou	b9bb2bbaf6	Merge pull request #119 from phantom5125/patch-1 Fix typo in nvshmem.patch	2025-04-14 09:29:46 +08:00
GreatHato	42f617088f	Fix typo in nvshmem.patch	2025-04-13 00:14:44 +08:00
Hao Lin	23c54150ba	Fix test combine args Signed-off-by: Hao Lin <linhaomails@gmail.com>	2025-04-11 18:21:09 +08:00
Chenggang Zhao	8a0ca8e2ec	Merge pull request #116 from alpha-baby/fix-test-result-not-output fix: not output result in some linux system	2025-04-11 13:23:37 +08:00

1 2

97 Commits