Shifang Xu
21efbe9b48
Support UE8M0 data format. ( #206 )
2025-06-12 09:38:19 +08:00
Chenggang Zhao
b8d90fb753
Support Ampere architecture ( #204 )
...
* Update README
* Update `setup.py`
* Fix headers
* Add `DISABLE_NVSHMEM` for APIs
* Fix launch
* Fix TMA settings
* Fix TMA usages
* Fix dlink
* Separate layout kernels
* Update version
* Add `is_sm90_compiled`
* Fix tests
* Add NVLink connection checks
* Update README
* Fix tests
* Add some comments
* Minor fix
* Minor fix
* Fix bugs
2025-06-11 15:48:18 +08:00
Chenggang Zhao
dd13c7145c
Check the empty list
2025-06-11 11:14:30 +08:00
Chenggang Zhao
a8299ca7c2
Support CUDA graph for intranode normal kernels ( #203 )
2025-06-11 11:08:54 +08:00
Chenggang Zhao
c8dceba110
Use TMA instead of LD/ST for intra-node normal kernels ( #191 )
...
* Update CMake files
* Use TMA instead of LD/ST for intranode dispatch
* Use TMA instead of LD/ST for intranode combine
* Adjust configs
* Test default configs as well
* More warps for combine
* Add inter-thread fence
* Enable more warps
* Do not use TMA for senders
* Update configs
* Remove useless wait
2025-06-06 15:40:17 +08:00
Hao Lin
23c54150ba
Fix test combine args
...
Signed-off-by: Hao Lin <linhaomails@gmail.com>
2025-04-11 18:21:09 +08:00
fujianhao.fjh
0f80da8458
fix: not output result in some linux system
2025-04-10 18:18:30 +08:00
Chenggang Zhao
ae0eafd2be
Remove confusing comments
2025-03-25 09:27:34 +08:00
Chenggang Zhao
1553fc42bf
Improve EP2/4 performance
2025-03-04 15:34:33 +08:00
Chenggang Zhao
c5b4040502
Enable intranode kernel tests with EP2 and EP4
2025-03-03 15:01:02 +08:00
Chenggang Zhao
ebfe47e46f
Initial commit
2025-02-25 09:07:53 +08:00