Chenggang Zhao
8da2d7b38d
Fully remove barrier FIFO designs ( #200 )
...
* Fully remove FIFO slots
* Fully remove FIFO buffers
* Minor fix styles
* Fix some typos
* Bugs fixed
* Cleanup `ibgda_poll_cq`
2025-06-10 16:23:20 +08:00
Shangyan Zhou
a16af40531
Merge pull request #201 from youkaichao/no_gdrcopy
...
remove the dependency of gdrcopy
2025-06-10 16:11:32 +08:00
youkaichao
b9b7ce348b
update readme
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-06-10 15:49:50 +08:00
youkaichao
97be5a3873
update the patch
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-06-10 15:39:44 +08:00
Chenggang Zhao
1157693c0c
Remove useless comments
2025-06-09 17:14:25 +08:00
Chenggang Zhao
5a2e37fa28
Support statistics tensor for low-latency kernels ( #196 )
2025-06-09 15:50:56 +08:00
Chenggang Zhao
0d1a855d81
Add low-latency kernel PCIe usage flag ( #195 )
...
* Add low-latency kernel usage flag
* Update comments
2025-06-09 14:37:13 +08:00
Chenggang Zhao
564e375234
Fix < PTX ISA 8.6
compatibility ( #194 )
2025-06-09 10:48:42 +08:00
Shangyan Zhou
11a0b0e1a3
Merge pull request #193 from fzyzcjy/feat/fix_mnnvl
...
Allow using MNNVL
2025-06-08 13:05:08 +08:00
fzyzcjy
4cd951700e
more
2025-06-07 21:39:00 +08:00
Chenggang Zhao
c8dceba110
Use TMA instead of LD/ST for intra-node normal kernels ( #191 )
...
* Update CMake files
* Use TMA instead of LD/ST for intranode dispatch
* Use TMA instead of LD/ST for intranode combine
* Adjust configs
* Test default configs as well
* More warps for combine
* Add inter-thread fence
* Enable more warps
* Do not use TMA for senders
* Update configs
* Remove useless wait
2025-06-06 15:40:17 +08:00
Shangyan Zhou
df4debe30c
Reduce NVSHMEM gpu memory usage and disable MNNVL. ( #190 )
...
Co-authored-by: Shangyan Zhou <sy.zhou@deepseek.com>
2025-06-06 13:25:43 +08:00
Chenggang Zhao
d8dd185c68
Update README
2025-06-05 14:41:51 +08:00
Shangyan Zhou
de8cfca3cf
Update readme.
2025-06-05 09:59:58 +08:00
Shangyan Zhou
fc48a467a7
Merge pull request #182 from wzc-wuzhicheng/fix-notify-dispatch
...
Fix notify_dispatch: using warp 0 to issue send
2025-06-05 09:29:16 +08:00
wzc.wuzhicheng
d0225df27d
Fix notify_dispatch: using warp 0 to issue send
...
Signed-off-by: wzc.wuzhicheng <wzc.wuzhicheng@linux.alibaba.com>
2025-06-03 20:20:02 +08:00
Shangyan Zhou
9fe9021f29
Use IBGDA only ( #177 )
2025-05-28 16:40:14 +08:00
Chenggang Zhao
aae9fa9a6d
Allow NVLink traffic for low-latency kernels by default
2025-05-23 20:14:50 +08:00
Shangyan Zhou
8da1b1f81e
Merge pull request #174 from deepseek-ai/p2p-refactor
...
Low-latency P2P code cleanup and bug fixed
2025-05-23 11:25:38 +08:00
Chenggang Zhao
92405ddf30
Code cleanup and bug fixed
2025-05-23 11:14:16 +08:00
cywork121
68ae8b3d07
Feature: LL nvlink p2p ( #173 )
2025-05-23 10:37:45 +08:00
guyueh1
d5ca4495c0
Make TORCH_CUDA_ARCH_LIST
as an environment variable ( #167 )
...
* Add 10.0 to TORCH_CUDA_ARCH_LIST
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
* Revert csrc/CMakeLists change; in setup.py make TORCH_CUDA_ARCH_LIST configurable
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
---------
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
2025-05-19 09:43:48 +08:00
Chenggang Zhao
bb393e7760
Merge pull request #154 from sleepcoo/support-more-hidden
...
Support hidden size 4096
2025-05-12 16:55:09 +08:00
sleepcoo
a107266a4e
support hidden size 4096
...
Co-authored-by: zhyncs <me@zhyncs.com>
Co-authored-by: yinfan98 <1106310035@qq.com>
2025-05-12 16:41:21 +08:00
Shangyan Zhou
05104029fd
Merge pull request #151 from vicoooo26/feat/nvidia-peer-mem-detection
...
Feat: enhance nvidia peer memory detection
2025-05-12 09:32:47 +08:00
Chenggang Zhao
f0a9f10629
Merge pull request #153 from wangfakang/opt-shuffled_dst
...
Shuffling the starting index of target rank for different ranks and channels
2025-05-12 09:25:21 +08:00
wangfakang
63c29d06a0
To mitigate incast congestion, shuffle the starting index of target rank for different ranks and channels
...
Signed-off-by: wangfakang <fakangwang@gmail.com>
2025-05-10 09:55:35 +08:00
Vico Chu
c6051f3880
Feat: enhance nvidia peer memory detection
2025-05-09 17:12:07 +08:00
Chenggang Zhao
9056a6db95
Merge pull request #142 from fzyzcjy/patch-3
...
Fix DeepEP cannot be used together with code that needs GIL such as Mooncake transfer engine
2025-05-08 16:04:54 +08:00
fzyzcjy
adc6e24cb0
Update deep_ep.cpp
2025-05-08 16:01:47 +08:00
fzyzcjy
23ded3bd8d
Update deep_ep.cpp
2025-04-29 09:58:31 +08:00
Shangyan Zhou
65e2a700f0
Merge pull request #135 from deepseek-ai/add-iw-fork
...
Add Infrawaves' fork to README.
2025-04-27 10:51:18 +08:00
Shangyan Zhou
1a0c8f6425
Add Infrawaves' fork to README.
2025-04-27 10:37:30 +08:00
Chenggang Zhao
007fcfcf97
Merge pull request #130 from deepseek-ai/trmt/internode_multi_qp
...
Support multi-QP for normal kernels
2025-04-22 13:04:42 +08:00
Shangyan Zhou
e255d57bef
Use put_nbi_warp
.
2025-04-22 12:29:46 +08:00
Shangyan Zhou
3b1045db43
Fix the performance data.
2025-04-22 11:23:42 +08:00
Chenggang Zhao
edbb1bc3ff
Several code lints
2025-04-22 10:52:10 +08:00
Shangyan Zhou
3e54b78fd7
Normal kernels always use IBGDA mode.
2025-04-22 10:36:24 +08:00
Shangyan Zhou
20b2aaaf9e
Refactor some code.
2025-04-22 10:22:30 +08:00
moningchen
c07fdd197c
Merge branch 'trmt/internode_multi_qp' of github.com:deepseek-ai/DeepEP into trmt/internode_multi_qp
2025-04-21 21:31:49 +08:00
moningchen
e0eaaf94fb
Add the performance data after internode optimization in the Readme file
2025-04-21 21:30:08 +08:00
Shangyan Zhou
e2c578485c
Revert ibgda_device.cuh
and remove some comments.
2025-04-21 17:44:32 +08:00
moningchen
5ab80c28f3
In the Internode Normal Kernel, when using nvshmem ibrc for RDMA data transmission, a single QP is used for data transfer between two GPUs, which limits kernel performance in network card dual-port and RoCE network scenarios.
...
In our optimized Internode Normal Kernel, we implemented multiple QPs for data transmission between two GPUs, setting a different QP for each channel. Additionally, we modified the transmission method from IBRC to IBGAD.
Through these optimizations, the Internode Normal Kernel achieves optimal performance in both H800 and H20 environments, with RDMA transmission performance nearly reaching the physical network performance limit. Using the current default statistical method, in 4-node H800 and H20 environments, RDMA performance can reach 60GB/s+.
2025-04-21 15:50:39 +08:00
Shangyan Zhou
a84a24808f
Merge pull request #124 from wplf/patch-1
...
Fix typo in nvshmem.patch
2025-04-16 10:57:31 +08:00
李金梁
a2ccc95d78
Fix typo in nvshmem.patch
2025-04-16 10:30:38 +08:00
Chenggang Zhao
a0c69317ab
Merge pull request #118 from andylin-hao/main
...
Fix test combine args
2025-04-14 15:51:30 +08:00
Shangyan Zhou
b9bb2bbaf6
Merge pull request #119 from phantom5125/patch-1
...
Fix typo in nvshmem.patch
2025-04-14 09:29:46 +08:00
GreatHato
42f617088f
Fix typo in nvshmem.patch
2025-04-13 00:14:44 +08:00
Hao Lin
23c54150ba
Fix test combine args
...
Signed-off-by: Hao Lin <linhaomails@gmail.com>
2025-04-11 18:21:09 +08:00
Chenggang Zhao
8a0ca8e2ec
Merge pull request #116 from alpha-baby/fix-test-result-not-output
...
fix: not output result in some linux system
2025-04-11 13:23:37 +08:00