Chenggang Zhao
74f4ef7b22
Remove the low-latency usage flag
2025-06-16 13:28:24 +08:00
Chenggang Zhao
1b92be8a71
Add automatic warp count control for low-latency kernels ( #213 )
...
* Add automatic warp count control for low-latency dispatch
* Add automatic warp count control for low-latency combine
* More assertions
2025-06-16 11:56:43 +08:00
fzyzcjy
4e923188f7
Update intranode.cu ( #210 )
2025-06-16 11:03:58 +08:00
Shangyan Zhou
483f00af84
Update assertion of num_rc_per_pe
.
2025-06-13 15:16:23 +08:00
Zhicheng Wu
05df5554ff
Use one qp per sm for internode normal kernels ( #181 )
...
let the sender SM use the channel_id, and the receiver SM use channel_id + num_channels
2025-06-13 14:37:59 +08:00
Shifang Xu
21efbe9b48
Support UE8M0 data format. ( #206 )
2025-06-12 09:38:19 +08:00
Chenggang Zhao
9ec061204e
Use pynvml
to detect NVLink connections ( #205 )
...
* Use `pynvml` to detect NVLink connections
* Add a TODO
* Add shutdown
* Fix comments
2025-06-11 17:29:00 +08:00
Chenggang Zhao
b8d90fb753
Support Ampere architecture ( #204 )
...
* Update README
* Update `setup.py`
* Fix headers
* Add `DISABLE_NVSHMEM` for APIs
* Fix launch
* Fix TMA settings
* Fix TMA usages
* Fix dlink
* Separate layout kernels
* Update version
* Add `is_sm90_compiled`
* Fix tests
* Add NVLink connection checks
* Update README
* Fix tests
* Add some comments
* Minor fix
* Minor fix
* Fix bugs
2025-06-11 15:48:18 +08:00
Chenggang Zhao
dd13c7145c
Check the empty list
2025-06-11 11:14:30 +08:00
Chenggang Zhao
a8299ca7c2
Support CUDA graph for intranode normal kernels ( #203 )
2025-06-11 11:08:54 +08:00
Chenggang Zhao
8da2d7b38d
Fully remove barrier FIFO designs ( #200 )
...
* Fully remove FIFO slots
* Fully remove FIFO buffers
* Minor fix styles
* Fix some typos
* Bugs fixed
* Cleanup `ibgda_poll_cq`
2025-06-10 16:23:20 +08:00
Shangyan Zhou
a16af40531
Merge pull request #201 from youkaichao/no_gdrcopy
...
remove the dependency of gdrcopy
2025-06-10 16:11:32 +08:00
youkaichao
b9b7ce348b
update readme
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-06-10 15:49:50 +08:00
youkaichao
97be5a3873
update the patch
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-06-10 15:39:44 +08:00
Chenggang Zhao
1157693c0c
Remove useless comments
2025-06-09 17:14:25 +08:00
Chenggang Zhao
5a2e37fa28
Support statistics tensor for low-latency kernels ( #196 )
2025-06-09 15:50:56 +08:00
Chenggang Zhao
0d1a855d81
Add low-latency kernel PCIe usage flag ( #195 )
...
* Add low-latency kernel usage flag
* Update comments
2025-06-09 14:37:13 +08:00
Chenggang Zhao
564e375234
Fix < PTX ISA 8.6
compatibility ( #194 )
2025-06-09 10:48:42 +08:00
Shangyan Zhou
11a0b0e1a3
Merge pull request #193 from fzyzcjy/feat/fix_mnnvl
...
Allow using MNNVL
2025-06-08 13:05:08 +08:00
fzyzcjy
4cd951700e
more
2025-06-07 21:39:00 +08:00
Chenggang Zhao
c8dceba110
Use TMA instead of LD/ST for intra-node normal kernels ( #191 )
...
* Update CMake files
* Use TMA instead of LD/ST for intranode dispatch
* Use TMA instead of LD/ST for intranode combine
* Adjust configs
* Test default configs as well
* More warps for combine
* Add inter-thread fence
* Enable more warps
* Do not use TMA for senders
* Update configs
* Remove useless wait
2025-06-06 15:40:17 +08:00
Shangyan Zhou
df4debe30c
Reduce NVSHMEM gpu memory usage and disable MNNVL. ( #190 )
...
Co-authored-by: Shangyan Zhou <sy.zhou@deepseek.com>
2025-06-06 13:25:43 +08:00
Chenggang Zhao
d8dd185c68
Update README
2025-06-05 14:41:51 +08:00
Shangyan Zhou
de8cfca3cf
Update readme.
2025-06-05 09:59:58 +08:00
Shangyan Zhou
fc48a467a7
Merge pull request #182 from wzc-wuzhicheng/fix-notify-dispatch
...
Fix notify_dispatch: using warp 0 to issue send
2025-06-05 09:29:16 +08:00
wzc.wuzhicheng
d0225df27d
Fix notify_dispatch: using warp 0 to issue send
...
Signed-off-by: wzc.wuzhicheng <wzc.wuzhicheng@linux.alibaba.com>
2025-06-03 20:20:02 +08:00
Shangyan Zhou
9fe9021f29
Use IBGDA only ( #177 )
2025-05-28 16:40:14 +08:00
Chenggang Zhao
aae9fa9a6d
Allow NVLink traffic for low-latency kernels by default
2025-05-23 20:14:50 +08:00
Shangyan Zhou
8da1b1f81e
Merge pull request #174 from deepseek-ai/p2p-refactor
...
Low-latency P2P code cleanup and bug fixed
2025-05-23 11:25:38 +08:00
Chenggang Zhao
92405ddf30
Code cleanup and bug fixed
2025-05-23 11:14:16 +08:00
cywork121
68ae8b3d07
Feature: LL nvlink p2p ( #173 )
2025-05-23 10:37:45 +08:00
guyueh1
d5ca4495c0
Make TORCH_CUDA_ARCH_LIST
as an environment variable ( #167 )
...
* Add 10.0 to TORCH_CUDA_ARCH_LIST
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
* Revert csrc/CMakeLists change; in setup.py make TORCH_CUDA_ARCH_LIST configurable
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
---------
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
2025-05-19 09:43:48 +08:00
Chenggang Zhao
bb393e7760
Merge pull request #154 from sleepcoo/support-more-hidden
...
Support hidden size 4096
2025-05-12 16:55:09 +08:00
sleepcoo
a107266a4e
support hidden size 4096
...
Co-authored-by: zhyncs <me@zhyncs.com>
Co-authored-by: yinfan98 <1106310035@qq.com>
2025-05-12 16:41:21 +08:00
Shangyan Zhou
05104029fd
Merge pull request #151 from vicoooo26/feat/nvidia-peer-mem-detection
...
Feat: enhance nvidia peer memory detection
2025-05-12 09:32:47 +08:00
Chenggang Zhao
f0a9f10629
Merge pull request #153 from wangfakang/opt-shuffled_dst
...
Shuffling the starting index of target rank for different ranks and channels
2025-05-12 09:25:21 +08:00
wangfakang
63c29d06a0
To mitigate incast congestion, shuffle the starting index of target rank for different ranks and channels
...
Signed-off-by: wangfakang <fakangwang@gmail.com>
2025-05-10 09:55:35 +08:00
Vico Chu
c6051f3880
Feat: enhance nvidia peer memory detection
2025-05-09 17:12:07 +08:00
Chenggang Zhao
9056a6db95
Merge pull request #142 from fzyzcjy/patch-3
...
Fix DeepEP cannot be used together with code that needs GIL such as Mooncake transfer engine
2025-05-08 16:04:54 +08:00
fzyzcjy
adc6e24cb0
Update deep_ep.cpp
2025-05-08 16:01:47 +08:00
fzyzcjy
23ded3bd8d
Update deep_ep.cpp
2025-04-29 09:58:31 +08:00
Shangyan Zhou
65e2a700f0
Merge pull request #135 from deepseek-ai/add-iw-fork
...
Add Infrawaves' fork to README.
2025-04-27 10:51:18 +08:00
Shangyan Zhou
1a0c8f6425
Add Infrawaves' fork to README.
2025-04-27 10:37:30 +08:00
Chenggang Zhao
007fcfcf97
Merge pull request #130 from deepseek-ai/trmt/internode_multi_qp
...
Support multi-QP for normal kernels
2025-04-22 13:04:42 +08:00
Shangyan Zhou
e255d57bef
Use put_nbi_warp
.
2025-04-22 12:29:46 +08:00
Shangyan Zhou
3b1045db43
Fix the performance data.
2025-04-22 11:23:42 +08:00
Chenggang Zhao
edbb1bc3ff
Several code lints
2025-04-22 10:52:10 +08:00
Shangyan Zhou
3e54b78fd7
Normal kernels always use IBGDA mode.
2025-04-22 10:36:24 +08:00
Shangyan Zhou
20b2aaaf9e
Refactor some code.
2025-04-22 10:22:30 +08:00
moningchen
c07fdd197c
Merge branch 'trmt/internode_multi_qp' of github.com:deepseek-ai/DeepEP into trmt/internode_multi_qp
2025-04-21 21:31:49 +08:00