Commit Graph

12 Commits

Author SHA1 Message Date
Chenggang Zhao
8041ed7164 Use 1D TMA store 2025-04-11 10:42:01 +08:00
Chenggang Zhao
b4ecf9c3ff Fix TMA multicast bugs 2025-04-07 14:34:42 +08:00
Chenggang Zhao
6db7e1863b Solve STSM bank conflict via padding and 3D TMA 2025-04-03 15:39:35 +08:00
Chenggang Zhao
046fab64b7 Fix grouped GEMM cases 2025-03-25 16:41:44 +08:00
Chenggang Zhao
b922e64cb2 Support block size 160 2025-03-25 13:37:59 +08:00
sazc
46eb0d08fb Performance: Larger BlockTile optimizations enable 1470+ TFLOPS FP8 performance on the H800-SXM platform 2025-03-25 10:44:57 +08:00
fzyzcjy
e7fff7ef0a
Update m_grouped_gemm.py 2025-03-13 22:09:15 +08:00
sazc
ed278eddd3 formats: Optimize get_best_configs implementation 2025-03-10 12:56:14 +08:00
sazc
50cf26cc7c Performance: Configuration algorithms tuned to minimize the impact of tail effects, now up to 1402 TFLOPS 2025-03-10 11:45:05 +08:00
Liang
ded740f736
Fix documentation of m_grouped_gemm_fp8_fp8_bf16_nt_contiguous in m_grouped_gemm.py 2025-03-04 11:26:23 +08:00
Chenggang Zhao
6da94d2d36 Add extra TMA checks 2025-02-27 18:20:57 +08:00
Chenggang Zhao
a6d97a1c1b Initial commit 2025-02-25 22:52:41 +08:00