DeepEP

mirror of https://github.com/deepseek-ai/DeepEP synced 2025-06-26 18:28:11 +00:00

Author	SHA1	Message	Date
Shangyan Zhou	85adc566e2	Add `get_comm_stream`. (#256 ) * Add `get_comm_stream`. * Fix style.	2025-06-25 18:12:29 +08:00
Shangyan Zhou	9eb2f84b3e	Optimize intranode combine. (#247 ) * Increase the test round. * Add warp synchronization. * Shuffle the send warps. * Add time elapsed into bench result.	2025-06-24 09:10:23 +08:00
fzyzcjy	fbcf430006	Update internode_ll.cu (#246 )	2025-06-23 15:18:10 +08:00
fzyzcjy	c95997f8c4	Update deep_ep.cpp (#242 )	2025-06-23 11:44:06 +08:00
Chenggang Zhao	7b0c25f864	Support more hidden size	2025-06-20 16:37:28 +08:00
Chenggang Zhao	9d4f7ef8ee	Surpass type checks	2025-06-18 16:04:42 +08:00
Chenggang Zhao	b56f7c2c8c	Adjust import order	2025-06-18 15:50:06 +08:00
Shangyan Zhou	a2d2354e1d	Merge pull request #222 from deepseek-ai/set_dev_id Set `device_id` to suppress pytorch warning.	2025-06-18 14:53:26 +08:00
Shangyan Zhou	cd371d31fc	Move import.	2025-06-18 14:52:04 +08:00
Shangyan Zhou	bf4a4a21d2	Set `device_id` to suppress pytorch warning.	2025-06-18 14:43:38 +08:00
Shangyan Zhou	77f97f79bd	Fix the tail loading issue. (#219 ) * Fix the tail loading issue. * Modify the sync offset.	2025-06-18 09:23:25 +08:00
Shangyan Zhou	dd133d39bc	Fix warp synchronization. (#215 ) * Fix warp synchronization. * Another fix.	2025-06-16 17:05:11 +08:00
Chenggang Zhao	8aaddf76ae	Remove the low-latency usage flag (#214 )	2025-06-16 13:30:14 +08:00
Chenggang Zhao	1b92be8a71	Add automatic warp count control for low-latency kernels (#213 ) * Add automatic warp count control for low-latency dispatch * Add automatic warp count control for low-latency combine * More assertions	2025-06-16 11:56:43 +08:00
fzyzcjy	4e923188f7	Update intranode.cu (#210 )	2025-06-16 11:03:58 +08:00
Shangyan Zhou	483f00af84	Update assertion of `num_rc_per_pe`.	2025-06-13 15:16:23 +08:00
Zhicheng Wu	05df5554ff	Use one qp per sm for internode normal kernels (#181 ) let the sender SM use the channel_id, and the receiver SM use channel_id + num_channels	2025-06-13 14:37:59 +08:00
Shifang Xu	21efbe9b48	Support UE8M0 data format. (#206 )	2025-06-12 09:38:19 +08:00
Chenggang Zhao	9ec061204e	Use `pynvml` to detect NVLink connections (#205 ) * Use `pynvml` to detect NVLink connections * Add a TODO * Add shutdown * Fix comments	2025-06-11 17:29:00 +08:00
Chenggang Zhao	b8d90fb753	Support Ampere architecture (#204 ) * Update README * Update `setup.py` * Fix headers * Add `DISABLE_NVSHMEM` for APIs * Fix launch * Fix TMA settings * Fix TMA usages * Fix dlink * Separate layout kernels * Update version * Add `is_sm90_compiled` * Fix tests * Add NVLink connection checks * Update README * Fix tests * Add some comments * Minor fix * Minor fix * Fix bugs	2025-06-11 15:48:18 +08:00
Chenggang Zhao	dd13c7145c	Check the empty list	2025-06-11 11:14:30 +08:00
Chenggang Zhao	a8299ca7c2	Support CUDA graph for intranode normal kernels (#203 )	2025-06-11 11:08:54 +08:00
Chenggang Zhao	8da2d7b38d	Fully remove barrier FIFO designs (#200 ) * Fully remove FIFO slots * Fully remove FIFO buffers * Minor fix styles * Fix some typos * Bugs fixed * Cleanup `ibgda_poll_cq`	2025-06-10 16:23:20 +08:00
Shangyan Zhou	a16af40531	Merge pull request #201 from youkaichao/no_gdrcopy remove the dependency of gdrcopy	2025-06-10 16:11:32 +08:00
youkaichao	b9b7ce348b	update readme Signed-off-by: youkaichao <youkaichao@gmail.com>	2025-06-10 15:49:50 +08:00
youkaichao	97be5a3873	update the patch Signed-off-by: youkaichao <youkaichao@gmail.com>	2025-06-10 15:39:44 +08:00
Chenggang Zhao	1157693c0c	Remove useless comments	2025-06-09 17:14:25 +08:00
Chenggang Zhao	5a2e37fa28	Support statistics tensor for low-latency kernels (#196 )	2025-06-09 15:50:56 +08:00
Chenggang Zhao	0d1a855d81	Add low-latency kernel PCIe usage flag (#195 ) * Add low-latency kernel usage flag * Update comments	2025-06-09 14:37:13 +08:00
Chenggang Zhao	564e375234	Fix `< PTX ISA 8.6` compatibility (#194 )	2025-06-09 10:48:42 +08:00
Shangyan Zhou	11a0b0e1a3	Merge pull request #193 from fzyzcjy/feat/fix_mnnvl Allow using MNNVL	2025-06-08 13:05:08 +08:00
fzyzcjy	4cd951700e	more	2025-06-07 21:39:00 +08:00
Chenggang Zhao	c8dceba110	Use TMA instead of LD/ST for intra-node normal kernels (#191 ) * Update CMake files * Use TMA instead of LD/ST for intranode dispatch * Use TMA instead of LD/ST for intranode combine * Adjust configs * Test default configs as well * More warps for combine * Add inter-thread fence * Enable more warps * Do not use TMA for senders * Update configs * Remove useless wait	2025-06-06 15:40:17 +08:00
Shangyan Zhou	df4debe30c	Reduce NVSHMEM gpu memory usage and disable MNNVL. (#190 ) Co-authored-by: Shangyan Zhou <sy.zhou@deepseek.com>	2025-06-06 13:25:43 +08:00
Chenggang Zhao	d8dd185c68	Update README	2025-06-05 14:41:51 +08:00
Shangyan Zhou	de8cfca3cf	Update readme.	2025-06-05 09:59:58 +08:00
Shangyan Zhou	fc48a467a7	Merge pull request #182 from wzc-wuzhicheng/fix-notify-dispatch Fix notify_dispatch: using warp 0 to issue send	2025-06-05 09:29:16 +08:00
wzc.wuzhicheng	d0225df27d	Fix notify_dispatch: using warp 0 to issue send Signed-off-by: wzc.wuzhicheng <wzc.wuzhicheng@linux.alibaba.com>	2025-06-03 20:20:02 +08:00
Shangyan Zhou	9fe9021f29	Use IBGDA only (#177 )	2025-05-28 16:40:14 +08:00
Chenggang Zhao	aae9fa9a6d	Allow NVLink traffic for low-latency kernels by default	2025-05-23 20:14:50 +08:00
Shangyan Zhou	8da1b1f81e	Merge pull request #174 from deepseek-ai/p2p-refactor Low-latency P2P code cleanup and bug fixed	2025-05-23 11:25:38 +08:00
Chenggang Zhao	92405ddf30	Code cleanup and bug fixed	2025-05-23 11:14:16 +08:00
cywork121	68ae8b3d07	Feature: LL nvlink p2p (#173 )	2025-05-23 10:37:45 +08:00
guyueh1	d5ca4495c0	Make `TORCH_CUDA_ARCH_LIST` as an environment variable (#167 ) * Add 10.0 to TORCH_CUDA_ARCH_LIST Signed-off-by: Guyue Huang <guyueh@nvidia.com> * Revert csrc/CMakeLists change; in setup.py make TORCH_CUDA_ARCH_LIST configurable Signed-off-by: Guyue Huang <guyueh@nvidia.com> --------- Signed-off-by: Guyue Huang <guyueh@nvidia.com>	2025-05-19 09:43:48 +08:00
Chenggang Zhao	bb393e7760	Merge pull request #154 from sleepcoo/support-more-hidden Support hidden size 4096	2025-05-12 16:55:09 +08:00
sleepcoo	a107266a4e	support hidden size 4096 Co-authored-by: zhyncs <me@zhyncs.com> Co-authored-by: yinfan98 <1106310035@qq.com>	2025-05-12 16:41:21 +08:00
Shangyan Zhou	05104029fd	Merge pull request #151 from vicoooo26/feat/nvidia-peer-mem-detection Feat: enhance nvidia peer memory detection	2025-05-12 09:32:47 +08:00
Chenggang Zhao	f0a9f10629	Merge pull request #153 from wangfakang/opt-shuffled_dst Shuffling the starting index of target rank for different ranks and channels	2025-05-12 09:25:21 +08:00
wangfakang	63c29d06a0	To mitigate incast congestion, shuffle the starting index of target rank for different ranks and channels Signed-off-by: wangfakang <fakangwang@gmail.com>	2025-05-10 09:55:35 +08:00
Vico Chu	c6051f3880	Feat: enhance nvidia peer memory detection	2025-05-09 17:12:07 +08:00

1 2 3

119 Commits