Commit Graph

20 Commits

Author SHA1 Message Date
Shangyan Zhou
bd429ffefc
Support bias. (#257)
* Support bias.

* Fix.

* Fix style.
2025-06-25 13:04:20 +08:00
Shangyan Zhou
b80e55e21f
Add get_comm_stream. (#256)
* Add `get_comm_stream`.

* Fix style.
2025-06-25 13:02:13 +08:00
fzyzcjy
c95997f8c4
Update deep_ep.cpp (#242) 2025-06-23 11:44:06 +08:00
Chenggang Zhao
8aaddf76ae
Remove the low-latency usage flag (#214) 2025-06-16 13:30:14 +08:00
Chenggang Zhao
1b92be8a71
Add automatic warp count control for low-latency kernels (#213)
* Add automatic warp count control for low-latency dispatch

* Add automatic warp count control for low-latency combine

* More assertions
2025-06-16 11:56:43 +08:00
Shifang Xu
21efbe9b48
Support UE8M0 data format. (#206) 2025-06-12 09:38:19 +08:00
Chenggang Zhao
b8d90fb753
Support Ampere architecture (#204)
* Update README

* Update `setup.py`

* Fix headers

* Add `DISABLE_NVSHMEM` for APIs

* Fix launch

* Fix TMA settings

* Fix TMA usages

* Fix dlink

* Separate layout kernels

* Update version

* Add `is_sm90_compiled`

* Fix tests

* Add NVLink connection checks

* Update README

* Fix tests

* Add some comments

* Minor fix

* Minor fix

* Fix bugs
2025-06-11 15:48:18 +08:00
Chenggang Zhao
a8299ca7c2
Support CUDA graph for intranode normal kernels (#203) 2025-06-11 11:08:54 +08:00
Chenggang Zhao
8da2d7b38d
Fully remove barrier FIFO designs (#200)
* Fully remove FIFO slots

* Fully remove FIFO buffers

* Minor fix styles

* Fix some typos

* Bugs fixed

* Cleanup `ibgda_poll_cq`
2025-06-10 16:23:20 +08:00
Chenggang Zhao
1157693c0c Remove useless comments 2025-06-09 17:14:25 +08:00
Chenggang Zhao
5a2e37fa28
Support statistics tensor for low-latency kernels (#196) 2025-06-09 15:50:56 +08:00
Chenggang Zhao
0d1a855d81
Add low-latency kernel PCIe usage flag (#195)
* Add low-latency kernel usage flag

* Update comments
2025-06-09 14:37:13 +08:00
Chenggang Zhao
92405ddf30 Code cleanup and bug fixed 2025-05-23 11:14:16 +08:00
fzyzcjy
adc6e24cb0
Update deep_ep.cpp 2025-05-08 16:01:47 +08:00
fzyzcjy
23ded3bd8d
Update deep_ep.cpp 2025-04-29 09:58:31 +08:00
Chenggang Zhao
dcaf73e5ff Support zero-copy for low-latency combine 2025-03-18 15:41:50 +08:00
Dmytro Dzhulgakov
b3b61ef5ef Allow passing output tensor in low_latency_combine 2025-03-10 22:19:21 +00:00
Chenggang Zhao
ed7487c15e Support BF16 for low-latency kernels 2025-03-10 17:24:41 +08:00
Chenggang Zhao
6cc3497df8 Remove all raw tensors for better P2P overlapping 2025-03-03 14:25:22 +08:00
Chenggang Zhao
ebfe47e46f Initial commit 2025-02-25 09:07:53 +08:00