Shangyan Zhou
4931324861
Support bias. ( #257 )
...
* Support bias.
* Fix.
* Fix style.
2025-06-25 18:12:33 +08:00
Shangyan Zhou
9eb2f84b3e
Optimize intranode combine. ( #247 )
...
* Increase the test round.
* Add warp synchronization.
* Shuffle the send warps.
* Add time elapsed into bench result.
2025-06-24 09:10:23 +08:00
fzyzcjy
4e923188f7
Update intranode.cu ( #210 )
2025-06-16 11:03:58 +08:00
Shifang Xu
21efbe9b48
Support UE8M0 data format. ( #206 )
2025-06-12 09:38:19 +08:00
Chenggang Zhao
b8d90fb753
Support Ampere architecture ( #204 )
...
* Update README
* Update `setup.py`
* Fix headers
* Add `DISABLE_NVSHMEM` for APIs
* Fix launch
* Fix TMA settings
* Fix TMA usages
* Fix dlink
* Separate layout kernels
* Update version
* Add `is_sm90_compiled`
* Fix tests
* Add NVLink connection checks
* Update README
* Fix tests
* Add some comments
* Minor fix
* Minor fix
* Fix bugs
2025-06-11 15:48:18 +08:00
Chenggang Zhao
a8299ca7c2
Support CUDA graph for intranode normal kernels ( #203 )
2025-06-11 11:08:54 +08:00
Chenggang Zhao
8da2d7b38d
Fully remove barrier FIFO designs ( #200 )
...
* Fully remove FIFO slots
* Fully remove FIFO buffers
* Minor fix styles
* Fix some typos
* Bugs fixed
* Cleanup `ibgda_poll_cq`
2025-06-10 16:23:20 +08:00
Chenggang Zhao
c8dceba110
Use TMA instead of LD/ST for intra-node normal kernels ( #191 )
...
* Update CMake files
* Use TMA instead of LD/ST for intranode dispatch
* Use TMA instead of LD/ST for intranode combine
* Adjust configs
* Test default configs as well
* More warps for combine
* Add inter-thread fence
* Enable more warps
* Do not use TMA for senders
* Update configs
* Remove useless wait
2025-06-06 15:40:17 +08:00
Chenggang Zhao
82dcf48fd3
Fix bugs for intranode EP kernels
2025-03-14 16:09:23 +08:00
Chenggang Zhao
1553fc42bf
Improve EP2/4 performance
2025-03-04 15:34:33 +08:00
Chenggang Zhao
ebfe47e46f
Initial commit
2025-02-25 09:07:53 +08:00