Commit Graph

14 Commits

Author SHA1 Message Date
Shangyan Zhou
4931324861 Support bias. (#257)
* Support bias.

* Fix.

* Fix style.
2025-06-25 18:12:33 +08:00
Shangyan Zhou
85adc566e2 Add get_comm_stream. (#256)
* Add `get_comm_stream`.

* Fix style.
2025-06-25 18:12:29 +08:00
Chenggang Zhao
8aaddf76ae
Remove the low-latency usage flag (#214) 2025-06-16 13:30:14 +08:00
Chenggang Zhao
1b92be8a71
Add automatic warp count control for low-latency kernels (#213)
* Add automatic warp count control for low-latency dispatch

* Add automatic warp count control for low-latency combine

* More assertions
2025-06-16 11:56:43 +08:00
Shifang Xu
21efbe9b48
Support UE8M0 data format. (#206) 2025-06-12 09:38:19 +08:00
Chenggang Zhao
a8299ca7c2
Support CUDA graph for intranode normal kernels (#203) 2025-06-11 11:08:54 +08:00
Chenggang Zhao
8da2d7b38d
Fully remove barrier FIFO designs (#200)
* Fully remove FIFO slots

* Fully remove FIFO buffers

* Minor fix styles

* Fix some typos

* Bugs fixed

* Cleanup `ibgda_poll_cq`
2025-06-10 16:23:20 +08:00
Chenggang Zhao
5a2e37fa28
Support statistics tensor for low-latency kernels (#196) 2025-06-09 15:50:56 +08:00
Chenggang Zhao
0d1a855d81
Add low-latency kernel PCIe usage flag (#195)
* Add low-latency kernel usage flag

* Update comments
2025-06-09 14:37:13 +08:00
Chenggang Zhao
92405ddf30 Code cleanup and bug fixed 2025-05-23 11:14:16 +08:00
Chenggang Zhao
dcaf73e5ff Support zero-copy for low-latency combine 2025-03-18 15:41:50 +08:00
Dmytro Dzhulgakov
b3b61ef5ef Allow passing output tensor in low_latency_combine 2025-03-10 22:19:21 +00:00
Chenggang Zhao
ed7487c15e Support BF16 for low-latency kernels 2025-03-10 17:24:41 +08:00
Chenggang Zhao
ebfe47e46f Initial commit 2025-02-25 09:07:53 +08:00