DeepEP

mirror of https://github.com/deepseek-ai/DeepEP synced 2025-06-26 18:28:11 +00:00

Author	SHA1	Message	Date
Chenggang Zhao	d7d13878e0	Add transaction windows	2025-06-24 10:12:23 +08:00
Chenggang Zhao	185ecf5c4a	Merge remote-tracking branch 'origin/main' into internode-tma # Conflicts: # csrc/kernels/configs.cuh # csrc/kernels/internode.cu	2025-06-24 09:29:07 +08:00
Chenggang Zhao	a15faa9ff0	Remove useless assertion	2025-06-24 09:21:35 +08:00
Chenggang Zhao	bc118b248a	Add the transaction window data structure for RDMA senders (#245 ) * Add draft * Add fast-debugging flags * Fix several bugs * Add sender timeout checks * Fix stuck * Fix bugs * Fix bugs	2025-06-24 09:12:40 +08:00
Shangyan Zhou	9eb2f84b3e	Optimize intranode combine. (#247 ) * Increase the test round. * Add warp synchronization. * Shuffle the send warps. * Add time elapsed into bench result.	2025-06-24 09:10:23 +08:00
fzyzcjy	fbcf430006	Update internode_ll.cu (#246 )	2025-06-23 15:18:10 +08:00
fzyzcjy	c95997f8c4	Update deep_ep.cpp (#242 )	2025-06-23 11:44:06 +08:00
Chenggang Zhao	7b0c25f864	Support more hidden size	2025-06-20 16:37:28 +08:00
Chenggang Zhao	a086ac5536	Use correct buffer pointers	2025-06-20 16:25:49 +08:00
Chenggang Zhao	782b40a8ff	Add `ENABLE_FAST_DEBUG`	2025-06-20 14:44:53 +08:00
Chenggang Zhao	47dd77ab5f	Add retired flag	2025-06-20 14:35:15 +08:00
Chenggang Zhao	74afd75df2	Fix bugs	2025-06-20 14:27:54 +08:00
Chenggang Zhao	371df2da52	Fix bugs	2025-06-20 13:44:49 +08:00
Chenggang Zhao	8da790e3f3	Fix the shifted buffer pointer	2025-06-20 11:31:57 +08:00
Chenggang Zhao	cd5c57fb2a	Fix compilation	2025-06-20 11:15:03 +08:00
Chenggang Zhao	49b9084268	Fix several bugs	2025-06-20 10:57:56 +08:00
Chenggang Zhao	177e491e92	Fix send heads	2025-06-19 18:05:59 +08:00
Chenggang Zhao	55bbd8caaf	Add impl	2025-06-19 17:15:43 +08:00
Chenggang Zhao	a0a6e22eff	Fully remove forwarders' and NVL receivers' code	2025-06-19 13:48:07 +08:00
Chenggang Zhao	3a3398f686	Minor fix	2025-06-19 10:38:42 +08:00
Chenggang Zhao	9d4f7ef8ee	Surpass type checks	2025-06-18 16:04:42 +08:00
Chenggang Zhao	b56f7c2c8c	Adjust import order	2025-06-18 15:50:06 +08:00
Shangyan Zhou	a2d2354e1d	Merge pull request #222 from deepseek-ai/set_dev_id Set `device_id` to suppress pytorch warning.	2025-06-18 14:53:26 +08:00
Shangyan Zhou	cd371d31fc	Move import.	2025-06-18 14:52:04 +08:00
Shangyan Zhou	bf4a4a21d2	Set `device_id` to suppress pytorch warning.	2025-06-18 14:43:38 +08:00
Chenggang Zhao	24453275e3	Add `EP_TEST_LL_COMPATIBILITY`	2025-06-18 10:59:44 +08:00
Shangyan Zhou	77f97f79bd	Fix the tail loading issue. (#219 ) * Fix the tail loading issue. * Modify the sync offset.	2025-06-18 09:23:25 +08:00
Shangyan Zhou	dd133d39bc	Fix warp synchronization. (#215 ) * Fix warp synchronization. * Another fix.	2025-06-16 17:05:11 +08:00
Chenggang Zhao	8aaddf76ae	Remove the low-latency usage flag (#214 )	2025-06-16 13:30:14 +08:00
Chenggang Zhao	1b92be8a71	Add automatic warp count control for low-latency kernels (#213 ) * Add automatic warp count control for low-latency dispatch * Add automatic warp count control for low-latency combine * More assertions	2025-06-16 11:56:43 +08:00
fzyzcjy	4e923188f7	Update intranode.cu (#210 )	2025-06-16 11:03:58 +08:00
Shangyan Zhou	483f00af84	Update assertion of `num_rc_per_pe`.	2025-06-13 15:16:23 +08:00
Zhicheng Wu	05df5554ff	Use one qp per sm for internode normal kernels (#181 ) let the sender SM use the channel_id, and the receiver SM use channel_id + num_channels	2025-06-13 14:37:59 +08:00
Shifang Xu	21efbe9b48	Support UE8M0 data format. (#206 )	2025-06-12 09:38:19 +08:00
Chenggang Zhao	9ec061204e	Use `pynvml` to detect NVLink connections (#205 ) * Use `pynvml` to detect NVLink connections * Add a TODO * Add shutdown * Fix comments	2025-06-11 17:29:00 +08:00
Chenggang Zhao	b8d90fb753	Support Ampere architecture (#204 ) * Update README * Update `setup.py` * Fix headers * Add `DISABLE_NVSHMEM` for APIs * Fix launch * Fix TMA settings * Fix TMA usages * Fix dlink * Separate layout kernels * Update version * Add `is_sm90_compiled` * Fix tests * Add NVLink connection checks * Update README * Fix tests * Add some comments * Minor fix * Minor fix * Fix bugs	2025-06-11 15:48:18 +08:00
Chenggang Zhao	dd13c7145c	Check the empty list	2025-06-11 11:14:30 +08:00
Chenggang Zhao	a8299ca7c2	Support CUDA graph for intranode normal kernels (#203 )	2025-06-11 11:08:54 +08:00
Chenggang Zhao	8da2d7b38d	Fully remove barrier FIFO designs (#200 ) * Fully remove FIFO slots * Fully remove FIFO buffers * Minor fix styles * Fix some typos * Bugs fixed * Cleanup `ibgda_poll_cq`	2025-06-10 16:23:20 +08:00
Shangyan Zhou	a16af40531	Merge pull request #201 from youkaichao/no_gdrcopy remove the dependency of gdrcopy	2025-06-10 16:11:32 +08:00
youkaichao	b9b7ce348b	update readme Signed-off-by: youkaichao <youkaichao@gmail.com>	2025-06-10 15:49:50 +08:00
youkaichao	97be5a3873	update the patch Signed-off-by: youkaichao <youkaichao@gmail.com>	2025-06-10 15:39:44 +08:00
Chenggang Zhao	1157693c0c	Remove useless comments	2025-06-09 17:14:25 +08:00
Chenggang Zhao	5a2e37fa28	Support statistics tensor for low-latency kernels (#196 )	2025-06-09 15:50:56 +08:00
Chenggang Zhao	0d1a855d81	Add low-latency kernel PCIe usage flag (#195 ) * Add low-latency kernel usage flag * Update comments	2025-06-09 14:37:13 +08:00
Chenggang Zhao	564e375234	Fix `< PTX ISA 8.6` compatibility (#194 )	2025-06-09 10:48:42 +08:00
Shangyan Zhou	11a0b0e1a3	Merge pull request #193 from fzyzcjy/feat/fix_mnnvl Allow using MNNVL	2025-06-08 13:05:08 +08:00
fzyzcjy	4cd951700e	more	2025-06-07 21:39:00 +08:00
Chenggang Zhao	c8dceba110	Use TMA instead of LD/ST for intra-node normal kernels (#191 ) * Update CMake files * Use TMA instead of LD/ST for intranode dispatch * Use TMA instead of LD/ST for intranode combine * Adjust configs * Test default configs as well * More warps for combine * Add inter-thread fence * Enable more warps * Do not use TMA for senders * Update configs * Remove useless wait	2025-06-06 15:40:17 +08:00
Shangyan Zhou	df4debe30c	Reduce NVSHMEM gpu memory usage and disable MNNVL. (#190 ) Co-authored-by: Shangyan Zhou <sy.zhou@deepseek.com>	2025-06-06 13:25:43 +08:00

1 2 3

135 Commits