Chenggang Zhao
|
42494864ba
|
Remove useless control metadata for low-latency combine
|
2025-04-07 09:55:39 +08:00 |
|
Chenggang Zhao
|
c4d12b4f8f
|
Fix compilation
|
2025-03-28 16:45:10 +08:00 |
|
songhexiang
|
4dd1e68ac8
|
For the SMs which calculate metadata in notify_dispatch, each warp in the SM is used to calculate the metadata of one channel. The default configuration is 8 warps for 10 channels, which needs two rounds of loop. Maybe the number of warps can be configured to the number of the channels so that one loop is enough.
|
2025-03-28 06:43:29 +00:00 |
|
Chenggang Zhao
|
ffc39ba084
|
Stronger acquire scope for low-latency kernels
|
2025-03-27 09:30:36 +08:00 |
|
Chenggang Zhao
|
dcaf73e5ff
|
Support zero-copy for low-latency combine
|
2025-03-18 15:41:50 +08:00 |
|
Chenggang Zhao
|
82dcf48fd3
|
Fix bugs for intranode EP kernels
|
2025-03-14 16:09:23 +08:00 |
|
Shangyan Zhou
|
38cdaf390c
|
Fix style.
|
2025-03-14 11:22:00 +08:00 |
|
Shangyan Zhou
|
2d0cf41dd1
|
Low latency kernels use rdma atomic to support AR.
|
2025-03-14 11:04:57 +08:00 |
|
Chenggang Zhao
|
ed7487c15e
|
Support BF16 for low-latency kernels
|
2025-03-10 17:24:41 +08:00 |
|
Chenggang Zhao
|
1fc40d50f3
|
Improve AR performance
|
2025-03-06 21:41:19 +08:00 |
|
Chenggang Zhao
|
458cdcb22a
|
Fix AR bugs for normal kernels
|
2025-03-05 17:13:35 +08:00 |
|
Chenggang Zhao
|
680e424bdc
|
Bugs fixed
|
2025-03-05 14:27:45 +08:00 |
|
Chenggang Zhao
|
1553fc42bf
|
Improve EP2/4 performance
|
2025-03-04 15:34:33 +08:00 |
|
Chenggang Zhao
|
6cc3497df8
|
Remove all raw tensors for better P2P overlapping
|
2025-03-03 14:25:22 +08:00 |
|
Chenggang Zhao
|
77bb07aa20
|
Update some comments and docs
|
2025-02-27 10:27:22 +08:00 |
|
Chenggang Zhao
|
ebfe47e46f
|
Initial commit
|
2025-02-25 09:07:53 +08:00 |
|