Chenggang Zhao
|
046fab64b7
|
Fix grouped GEMM cases
|
2025-03-25 16:41:44 +08:00 |
|
Chenggang Zhao
|
b922e64cb2
|
Support block size 160
|
2025-03-25 13:37:59 +08:00 |
|
sazc
|
46eb0d08fb
|
Performance: Larger BlockTile optimizations enable 1470+ TFLOPS FP8 performance on the H800-SXM platform
|
2025-03-25 10:44:57 +08:00 |
|
fzyzcjy
|
e7fff7ef0a
|
Update m_grouped_gemm.py
|
2025-03-13 22:09:15 +08:00 |
|
sazc
|
ed278eddd3
|
formats: Optimize get_best_configs implementation
|
2025-03-10 12:56:14 +08:00 |
|
sazc
|
50cf26cc7c
|
Performance: Configuration algorithms tuned to minimize the impact of tail effects, now up to 1402 TFLOPS
|
2025-03-10 11:45:05 +08:00 |
|
Liang
|
ded740f736
|
Fix documentation of m_grouped_gemm_fp8_fp8_bf16_nt_contiguous in m_grouped_gemm.py
|
2025-03-04 11:26:23 +08:00 |
|
Chenggang Zhao
|
6da94d2d36
|
Add extra TMA checks
|
2025-02-27 18:20:57 +08:00 |
|
Chenggang Zhao
|
a6d97a1c1b
|
Initial commit
|
2025-02-25 22:52:41 +08:00 |
|