Commit Graph

11 Commits

Author SHA1 Message Date
Shangyan Zhou
4931324861 Support bias. (#257)
* Support bias.

* Fix.

* Fix style.
2025-06-25 18:12:33 +08:00
Shangyan Zhou
9eb2f84b3e
Optimize intranode combine. (#247)
* Increase the test round.

* Add warp synchronization.

* Shuffle the send warps.

* Add time elapsed into bench result.
2025-06-24 09:10:23 +08:00
fzyzcjy
4e923188f7
Update intranode.cu (#210) 2025-06-16 11:03:58 +08:00
Shifang Xu
21efbe9b48
Support UE8M0 data format. (#206) 2025-06-12 09:38:19 +08:00
Chenggang Zhao
b8d90fb753
Support Ampere architecture (#204)
* Update README

* Update `setup.py`

* Fix headers

* Add `DISABLE_NVSHMEM` for APIs

* Fix launch

* Fix TMA settings

* Fix TMA usages

* Fix dlink

* Separate layout kernels

* Update version

* Add `is_sm90_compiled`

* Fix tests

* Add NVLink connection checks

* Update README

* Fix tests

* Add some comments

* Minor fix

* Minor fix

* Fix bugs
2025-06-11 15:48:18 +08:00
Chenggang Zhao
a8299ca7c2
Support CUDA graph for intranode normal kernels (#203) 2025-06-11 11:08:54 +08:00
Chenggang Zhao
8da2d7b38d
Fully remove barrier FIFO designs (#200)
* Fully remove FIFO slots

* Fully remove FIFO buffers

* Minor fix styles

* Fix some typos

* Bugs fixed

* Cleanup `ibgda_poll_cq`
2025-06-10 16:23:20 +08:00
Chenggang Zhao
c8dceba110
Use TMA instead of LD/ST for intra-node normal kernels (#191)
* Update CMake files

* Use TMA instead of LD/ST for intranode dispatch

* Use TMA instead of LD/ST for intranode combine

* Adjust configs

* Test default configs as well

* More warps for combine

* Add inter-thread fence

* Enable more warps

* Do not use TMA for senders

* Update configs

* Remove useless wait
2025-06-06 15:40:17 +08:00
Chenggang Zhao
82dcf48fd3 Fix bugs for intranode EP kernels 2025-03-14 16:09:23 +08:00
Chenggang Zhao
1553fc42bf Improve EP2/4 performance 2025-03-04 15:34:33 +08:00
Chenggang Zhao
ebfe47e46f Initial commit 2025-02-25 09:07:53 +08:00