Chenggang Zhao
d7d13878e0
Add transaction windows
2025-06-24 10:12:23 +08:00
Chenggang Zhao
185ecf5c4a
Merge remote-tracking branch 'origin/main' into internode-tma
...
# Conflicts:
# csrc/kernels/configs.cuh
# csrc/kernels/internode.cu
2025-06-24 09:29:07 +08:00
Chenggang Zhao
a15faa9ff0
Remove useless assertion
2025-06-24 09:21:35 +08:00
Chenggang Zhao
bc118b248a
Add the transaction window data structure for RDMA senders ( #245 )
...
* Add draft
* Add fast-debugging flags
* Fix several bugs
* Add sender timeout checks
* Fix stuck
* Fix bugs
* Fix bugs
2025-06-24 09:12:40 +08:00
Shangyan Zhou
9eb2f84b3e
Optimize intranode combine. ( #247 )
...
* Increase the test round.
* Add warp synchronization.
* Shuffle the send warps.
* Add time elapsed into bench result.
2025-06-24 09:10:23 +08:00
fzyzcjy
fbcf430006
Update internode_ll.cu ( #246 )
2025-06-23 15:18:10 +08:00
fzyzcjy
c95997f8c4
Update deep_ep.cpp ( #242 )
2025-06-23 11:44:06 +08:00
Chenggang Zhao
7b0c25f864
Support more hidden size
2025-06-20 16:37:28 +08:00
Chenggang Zhao
a086ac5536
Use correct buffer pointers
2025-06-20 16:25:49 +08:00
Chenggang Zhao
782b40a8ff
Add ENABLE_FAST_DEBUG
2025-06-20 14:44:53 +08:00
Chenggang Zhao
47dd77ab5f
Add retired flag
2025-06-20 14:35:15 +08:00
Chenggang Zhao
74afd75df2
Fix bugs
2025-06-20 14:27:54 +08:00
Chenggang Zhao
371df2da52
Fix bugs
2025-06-20 13:44:49 +08:00
Chenggang Zhao
8da790e3f3
Fix the shifted buffer pointer
2025-06-20 11:31:57 +08:00
Chenggang Zhao
cd5c57fb2a
Fix compilation
2025-06-20 11:15:03 +08:00
Chenggang Zhao
49b9084268
Fix several bugs
2025-06-20 10:57:56 +08:00
Chenggang Zhao
177e491e92
Fix send heads
2025-06-19 18:05:59 +08:00
Chenggang Zhao
55bbd8caaf
Add impl
2025-06-19 17:15:43 +08:00
Chenggang Zhao
a0a6e22eff
Fully remove forwarders' and NVL receivers' code
2025-06-19 13:48:07 +08:00
Chenggang Zhao
3a3398f686
Minor fix
2025-06-19 10:38:42 +08:00
Chenggang Zhao
9d4f7ef8ee
Surpass type checks
2025-06-18 16:04:42 +08:00
Chenggang Zhao
b56f7c2c8c
Adjust import order
2025-06-18 15:50:06 +08:00
Shangyan Zhou
a2d2354e1d
Merge pull request #222 from deepseek-ai/set_dev_id
...
Set `device_id` to suppress pytorch warning.
2025-06-18 14:53:26 +08:00
Shangyan Zhou
cd371d31fc
Move import.
2025-06-18 14:52:04 +08:00
Shangyan Zhou
bf4a4a21d2
Set device_id
to suppress pytorch warning.
2025-06-18 14:43:38 +08:00
Chenggang Zhao
24453275e3
Add EP_TEST_LL_COMPATIBILITY
2025-06-18 10:59:44 +08:00
Shangyan Zhou
77f97f79bd
Fix the tail loading issue. ( #219 )
...
* Fix the tail loading issue.
* Modify the sync offset.
2025-06-18 09:23:25 +08:00
Shangyan Zhou
dd133d39bc
Fix warp synchronization. ( #215 )
...
* Fix warp synchronization.
* Another fix.
2025-06-16 17:05:11 +08:00
Chenggang Zhao
8aaddf76ae
Remove the low-latency usage flag ( #214 )
2025-06-16 13:30:14 +08:00
Chenggang Zhao
1b92be8a71
Add automatic warp count control for low-latency kernels ( #213 )
...
* Add automatic warp count control for low-latency dispatch
* Add automatic warp count control for low-latency combine
* More assertions
2025-06-16 11:56:43 +08:00
fzyzcjy
4e923188f7
Update intranode.cu ( #210 )
2025-06-16 11:03:58 +08:00
Shangyan Zhou
483f00af84
Update assertion of num_rc_per_pe
.
2025-06-13 15:16:23 +08:00
Zhicheng Wu
05df5554ff
Use one qp per sm for internode normal kernels ( #181 )
...
let the sender SM use the channel_id, and the receiver SM use channel_id + num_channels
2025-06-13 14:37:59 +08:00
Shifang Xu
21efbe9b48
Support UE8M0 data format. ( #206 )
2025-06-12 09:38:19 +08:00
Chenggang Zhao
9ec061204e
Use pynvml
to detect NVLink connections ( #205 )
...
* Use `pynvml` to detect NVLink connections
* Add a TODO
* Add shutdown
* Fix comments
2025-06-11 17:29:00 +08:00
Chenggang Zhao
b8d90fb753
Support Ampere architecture ( #204 )
...
* Update README
* Update `setup.py`
* Fix headers
* Add `DISABLE_NVSHMEM` for APIs
* Fix launch
* Fix TMA settings
* Fix TMA usages
* Fix dlink
* Separate layout kernels
* Update version
* Add `is_sm90_compiled`
* Fix tests
* Add NVLink connection checks
* Update README
* Fix tests
* Add some comments
* Minor fix
* Minor fix
* Fix bugs
2025-06-11 15:48:18 +08:00
Chenggang Zhao
dd13c7145c
Check the empty list
2025-06-11 11:14:30 +08:00
Chenggang Zhao
a8299ca7c2
Support CUDA graph for intranode normal kernels ( #203 )
2025-06-11 11:08:54 +08:00
Chenggang Zhao
8da2d7b38d
Fully remove barrier FIFO designs ( #200 )
...
* Fully remove FIFO slots
* Fully remove FIFO buffers
* Minor fix styles
* Fix some typos
* Bugs fixed
* Cleanup `ibgda_poll_cq`
2025-06-10 16:23:20 +08:00
Shangyan Zhou
a16af40531
Merge pull request #201 from youkaichao/no_gdrcopy
...
remove the dependency of gdrcopy
2025-06-10 16:11:32 +08:00
youkaichao
b9b7ce348b
update readme
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-06-10 15:49:50 +08:00
youkaichao
97be5a3873
update the patch
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-06-10 15:39:44 +08:00
Chenggang Zhao
1157693c0c
Remove useless comments
2025-06-09 17:14:25 +08:00
Chenggang Zhao
5a2e37fa28
Support statistics tensor for low-latency kernels ( #196 )
2025-06-09 15:50:56 +08:00
Chenggang Zhao
0d1a855d81
Add low-latency kernel PCIe usage flag ( #195 )
...
* Add low-latency kernel usage flag
* Update comments
2025-06-09 14:37:13 +08:00
Chenggang Zhao
564e375234
Fix < PTX ISA 8.6
compatibility ( #194 )
2025-06-09 10:48:42 +08:00
Shangyan Zhou
11a0b0e1a3
Merge pull request #193 from fzyzcjy/feat/fix_mnnvl
...
Allow using MNNVL
2025-06-08 13:05:08 +08:00
fzyzcjy
4cd951700e
more
2025-06-07 21:39:00 +08:00
Chenggang Zhao
c8dceba110
Use TMA instead of LD/ST for intra-node normal kernels ( #191 )
...
* Update CMake files
* Use TMA instead of LD/ST for intranode dispatch
* Use TMA instead of LD/ST for intranode combine
* Adjust configs
* Test default configs as well
* More warps for combine
* Add inter-thread fence
* Enable more warps
* Do not use TMA for senders
* Update configs
* Remove useless wait
2025-06-06 15:40:17 +08:00
Shangyan Zhou
df4debe30c
Reduce NVSHMEM gpu memory usage and disable MNNVL. ( #190 )
...
Co-authored-by: Shangyan Zhou <sy.zhou@deepseek.com>
2025-06-06 13:25:43 +08:00