Chenggang Zhao
|
1b92be8a71
|
Add automatic warp count control for low-latency kernels (#213)
* Add automatic warp count control for low-latency dispatch
* Add automatic warp count control for low-latency combine
* More assertions
|
2025-06-16 11:56:43 +08:00 |
|
Chenggang Zhao
|
b8d90fb753
|
Support Ampere architecture (#204)
* Update README
* Update `setup.py`
* Fix headers
* Add `DISABLE_NVSHMEM` for APIs
* Fix launch
* Fix TMA settings
* Fix TMA usages
* Fix dlink
* Separate layout kernels
* Update version
* Add `is_sm90_compiled`
* Fix tests
* Add NVLink connection checks
* Update README
* Fix tests
* Add some comments
* Minor fix
* Minor fix
* Fix bugs
|
2025-06-11 15:48:18 +08:00 |
|
Chenggang Zhao
|
42494864ba
|
Remove useless control metadata for low-latency combine
|
2025-04-07 09:55:39 +08:00 |
|
Chenggang Zhao
|
66465476ae
|
Support zero-copy for low-latency combine
|
2025-03-18 15:44:26 +08:00 |
|
Chenggang Zhao
|
dcaf73e5ff
|
Support zero-copy for low-latency combine
|
2025-03-18 15:41:50 +08:00 |
|
Chenggang Zhao
|
ed7487c15e
|
Support BF16 for low-latency kernels
|
2025-03-10 17:24:41 +08:00 |
|
Chenggang Zhao
|
1fc40d50f3
|
Improve AR performance
|
2025-03-06 21:41:19 +08:00 |
|
Chenggang Zhao
|
6cc3497df8
|
Remove all raw tensors for better P2P overlapping
|
2025-03-03 14:25:22 +08:00 |
|
Chenggang Zhao
|
ebfe47e46f
|
Initial commit
|
2025-02-25 09:07:53 +08:00 |
|