Commit Graph

97 Commits

Author SHA1 Message Date
Chenggang Zhao
8da2d7b38d
Fully remove barrier FIFO designs (#200)
* Fully remove FIFO slots

* Fully remove FIFO buffers

* Minor fix styles

* Fix some typos

* Bugs fixed

* Cleanup `ibgda_poll_cq`
2025-06-10 16:23:20 +08:00
Shangyan Zhou
a16af40531
Merge pull request #201 from youkaichao/no_gdrcopy
remove the dependency of gdrcopy
2025-06-10 16:11:32 +08:00
youkaichao
b9b7ce348b update readme
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-06-10 15:49:50 +08:00
youkaichao
97be5a3873 update the patch
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-06-10 15:39:44 +08:00
Chenggang Zhao
1157693c0c Remove useless comments 2025-06-09 17:14:25 +08:00
Chenggang Zhao
5a2e37fa28
Support statistics tensor for low-latency kernels (#196) 2025-06-09 15:50:56 +08:00
Chenggang Zhao
0d1a855d81
Add low-latency kernel PCIe usage flag (#195)
* Add low-latency kernel usage flag

* Update comments
2025-06-09 14:37:13 +08:00
Chenggang Zhao
564e375234
Fix < PTX ISA 8.6 compatibility (#194) 2025-06-09 10:48:42 +08:00
Shangyan Zhou
11a0b0e1a3
Merge pull request #193 from fzyzcjy/feat/fix_mnnvl
Allow using MNNVL
2025-06-08 13:05:08 +08:00
fzyzcjy
4cd951700e more 2025-06-07 21:39:00 +08:00
Chenggang Zhao
c8dceba110
Use TMA instead of LD/ST for intra-node normal kernels (#191)
* Update CMake files

* Use TMA instead of LD/ST for intranode dispatch

* Use TMA instead of LD/ST for intranode combine

* Adjust configs

* Test default configs as well

* More warps for combine

* Add inter-thread fence

* Enable more warps

* Do not use TMA for senders

* Update configs

* Remove useless wait
2025-06-06 15:40:17 +08:00
Shangyan Zhou
df4debe30c
Reduce NVSHMEM gpu memory usage and disable MNNVL. (#190)
Co-authored-by: Shangyan Zhou <sy.zhou@deepseek.com>
2025-06-06 13:25:43 +08:00
Chenggang Zhao
d8dd185c68 Update README 2025-06-05 14:41:51 +08:00
Shangyan Zhou
de8cfca3cf Update readme. 2025-06-05 09:59:58 +08:00
Shangyan Zhou
fc48a467a7
Merge pull request #182 from wzc-wuzhicheng/fix-notify-dispatch
Fix notify_dispatch: using warp 0 to issue send
2025-06-05 09:29:16 +08:00
wzc.wuzhicheng
d0225df27d Fix notify_dispatch: using warp 0 to issue send
Signed-off-by: wzc.wuzhicheng <wzc.wuzhicheng@linux.alibaba.com>
2025-06-03 20:20:02 +08:00
Shangyan Zhou
9fe9021f29
Use IBGDA only (#177) 2025-05-28 16:40:14 +08:00
Chenggang Zhao
aae9fa9a6d
Allow NVLink traffic for low-latency kernels by default 2025-05-23 20:14:50 +08:00
Shangyan Zhou
8da1b1f81e
Merge pull request #174 from deepseek-ai/p2p-refactor
Low-latency P2P code cleanup and bug fixed
2025-05-23 11:25:38 +08:00
Chenggang Zhao
92405ddf30 Code cleanup and bug fixed 2025-05-23 11:14:16 +08:00
cywork121
68ae8b3d07
Feature: LL nvlink p2p (#173) 2025-05-23 10:37:45 +08:00
guyueh1
d5ca4495c0
Make TORCH_CUDA_ARCH_LIST as an environment variable (#167)
* Add 10.0 to TORCH_CUDA_ARCH_LIST

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

* Revert csrc/CMakeLists change; in setup.py make TORCH_CUDA_ARCH_LIST configurable

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

---------

Signed-off-by: Guyue Huang <guyueh@nvidia.com>
2025-05-19 09:43:48 +08:00
Chenggang Zhao
bb393e7760
Merge pull request #154 from sleepcoo/support-more-hidden
Support hidden size 4096
2025-05-12 16:55:09 +08:00
sleepcoo
a107266a4e support hidden size 4096
Co-authored-by: zhyncs <me@zhyncs.com>
Co-authored-by: yinfan98 <1106310035@qq.com>
2025-05-12 16:41:21 +08:00
Shangyan Zhou
05104029fd
Merge pull request #151 from vicoooo26/feat/nvidia-peer-mem-detection
Feat: enhance nvidia peer memory detection
2025-05-12 09:32:47 +08:00
Chenggang Zhao
f0a9f10629
Merge pull request #153 from wangfakang/opt-shuffled_dst
Shuffling the starting index of target rank for different ranks and channels
2025-05-12 09:25:21 +08:00
wangfakang
63c29d06a0 To mitigate incast congestion, shuffle the starting index of target rank for different ranks and channels
Signed-off-by: wangfakang <fakangwang@gmail.com>
2025-05-10 09:55:35 +08:00
Vico Chu
c6051f3880 Feat: enhance nvidia peer memory detection 2025-05-09 17:12:07 +08:00
Chenggang Zhao
9056a6db95
Merge pull request #142 from fzyzcjy/patch-3
Fix DeepEP cannot be used together with code that needs GIL such as Mooncake transfer engine
2025-05-08 16:04:54 +08:00
fzyzcjy
adc6e24cb0
Update deep_ep.cpp 2025-05-08 16:01:47 +08:00
fzyzcjy
23ded3bd8d
Update deep_ep.cpp 2025-04-29 09:58:31 +08:00
Shangyan Zhou
65e2a700f0
Merge pull request #135 from deepseek-ai/add-iw-fork
Add Infrawaves' fork to README.
2025-04-27 10:51:18 +08:00
Shangyan Zhou
1a0c8f6425 Add Infrawaves' fork to README. 2025-04-27 10:37:30 +08:00
Chenggang Zhao
007fcfcf97
Merge pull request #130 from deepseek-ai/trmt/internode_multi_qp
Support multi-QP for normal kernels
2025-04-22 13:04:42 +08:00
Shangyan Zhou
e255d57bef Use put_nbi_warp. 2025-04-22 12:29:46 +08:00
Shangyan Zhou
3b1045db43 Fix the performance data. 2025-04-22 11:23:42 +08:00
Chenggang Zhao
edbb1bc3ff Several code lints 2025-04-22 10:52:10 +08:00
Shangyan Zhou
3e54b78fd7 Normal kernels always use IBGDA mode. 2025-04-22 10:36:24 +08:00
Shangyan Zhou
20b2aaaf9e Refactor some code. 2025-04-22 10:22:30 +08:00
moningchen
c07fdd197c Merge branch 'trmt/internode_multi_qp' of github.com:deepseek-ai/DeepEP into trmt/internode_multi_qp 2025-04-21 21:31:49 +08:00
moningchen
e0eaaf94fb Add the performance data after internode optimization in the Readme file 2025-04-21 21:30:08 +08:00
Shangyan Zhou
e2c578485c Revert ibgda_device.cuh and remove some comments. 2025-04-21 17:44:32 +08:00
moningchen
5ab80c28f3 In the Internode Normal Kernel, when using nvshmem ibrc for RDMA data transmission, a single QP is used for data transfer between two GPUs, which limits kernel performance in network card dual-port and RoCE network scenarios.
In our optimized Internode Normal Kernel, we implemented multiple QPs for data transmission between two GPUs, setting a different QP for each channel. Additionally, we modified the transmission method from IBRC to IBGAD.

Through these optimizations, the Internode Normal Kernel achieves optimal performance in both H800 and H20 environments, with RDMA transmission performance nearly reaching the physical network performance limit. Using the current default statistical method, in 4-node H800 and H20 environments, RDMA performance can reach 60GB/s+.
2025-04-21 15:50:39 +08:00
Shangyan Zhou
a84a24808f
Merge pull request #124 from wplf/patch-1
Fix typo in nvshmem.patch
2025-04-16 10:57:31 +08:00
李金梁
a2ccc95d78
Fix typo in nvshmem.patch 2025-04-16 10:30:38 +08:00
Chenggang Zhao
a0c69317ab
Merge pull request #118 from andylin-hao/main
Fix test combine args
2025-04-14 15:51:30 +08:00
Shangyan Zhou
b9bb2bbaf6
Merge pull request #119 from phantom5125/patch-1
Fix typo in nvshmem.patch
2025-04-14 09:29:46 +08:00
GreatHato
42f617088f
Fix typo in nvshmem.patch 2025-04-13 00:14:44 +08:00
Hao Lin
23c54150ba
Fix test combine args
Signed-off-by: Hao Lin <linhaomails@gmail.com>
2025-04-11 18:21:09 +08:00
Chenggang Zhao
8a0ca8e2ec
Merge pull request #116 from alpha-baby/fix-test-result-not-output
fix: not output result in some linux system
2025-04-11 13:23:37 +08:00