Commit Graph

44 Commits

Author SHA1 Message Date
ademeure
6cbff5778f Correctly flush L2, as reconstructing the tensors on every iteration effectively put them in the L2, and gave the GPU enough idle time to avoid thermal throttling in a potentially unrealistic way.
The previous behaviour is potentially representative of some use cases (e.g. previous kernel filling L2 with the data in a very specific way) but not standard benchmarking practice.
2025-03-15 20:46:24 +00:00
Liang
e1c070fbef
Merge pull request #65 from Z-NAVY/main
Fix get_col_major_tma_aligned_tensor to handle 2-dimensional inputs
2025-03-14 13:50:08 +08:00
Liang
4377c4dc57
Merge pull request #63 from fzyzcjy/patch-2
Super tiny fix typo
2025-03-14 10:27:48 +08:00
z-navy
3f92607b98 Fix get_col_major_tma_aligned_tensor to handle 2-dimensional inputs 2025-03-13 22:15:16 +08:00
fzyzcjy
e7fff7ef0a
Update m_grouped_gemm.py 2025-03-13 22:09:15 +08:00
Chenggang Zhao
bd2a775528 Code format 2025-03-11 13:26:10 +08:00
Chenggang Zhao
5233bad1e9
Merge pull request #55 from sleepcoo/fix-cudagraph
fix cuda_graph rng check error
2025-03-11 13:25:35 +08:00
sleepcoo
723a00338e fix cuda_graph rng check error 2025-03-11 12:40:42 +08:00
Chenggang Zhao
5e4badc577 Fix type lint 2025-03-10 13:10:16 +08:00
Chenggang Zhao
ba1e93a5c7
Merge pull request #44 from sazczmh/main
Performance: Configuration algorithms tuned to minimize the impact of tail effects, now up to 1402 TFLOPS
2025-03-10 13:08:03 +08:00
sazc
bed67b234c Minor fix 2025-03-10 13:02:02 +08:00
sazc
ed278eddd3 formats: Optimize get_best_configs implementation 2025-03-10 12:56:14 +08:00
sazc
50cf26cc7c Performance: Configuration algorithms tuned to minimize the impact of tail effects, now up to 1402 TFLOPS 2025-03-10 11:45:05 +08:00
Chenggang Zhao
39c10e6c31 Revert "Merge pull request #49 from A-transformer/maximum_fp8_e4m3_value"
This reverts commit 4d4f2342fe, reversing
changes made to 9d3222a93e.
2025-03-10 09:47:02 +08:00
Liang
4d4f2342fe
Merge pull request #49 from A-transformer/maximum_fp8_e4m3_value
Maximum representable value in FP8 E4M3 format
2025-03-08 09:50:06 +08:00
A-transformer
629857685e
Maximum representable value in FP8 E4M3 format
Replace Hardcoded 448.0 with Global Constant FP8_E4M3_MAX for FP8 E4M3 Format
2025-03-07 19:58:02 +04:00
Liang
9d3222a93e
Merge pull request #42 from sazczmh/main
Performance: reducing the percentage of FFMA interleaving yields a sight performance gain, roughly 0.5%
2025-03-05 20:09:56 +08:00
sazc
fcd1dcd99d Performance: reducing the percentage of FFMA interleaving yields a slight performance gain, roughly 0.5% 2025-03-05 17:50:22 +08:00
Chenggang Zhao
68fc742572
Merge pull request #36 from yz-tang/fix_setup_build
Fix setup build error when setuptools version is lower
2025-03-04 17:09:50 +08:00
yz-tang
6c59e0f40d fix setup build error when setuptools version is lower 2025-03-04 16:53:00 +08:00
Chenggang Zhao
9b0dad8640 Add some notes for promotion 2025-03-04 11:42:20 +08:00
Liang
ded740f736
Fix documentation of m_grouped_gemm_fp8_fp8_bf16_nt_contiguous in m_grouped_gemm.py 2025-03-04 11:26:23 +08:00
Chenggang Zhao
dff6bb6f0b Add some notes 2025-03-03 11:35:52 +08:00
Chenggang Zhao
6c5da03ba9 Support more shapes 2025-02-28 10:04:59 +08:00
Chenggang Zhao
b69f630b91 Minor fix util function 2025-02-28 09:46:38 +08:00
Chenggang Zhao
6e10cba207 Minor fix 2025-02-28 09:21:35 +08:00
Liang
fbec9e5eee
Update get_best_configs
a better strategy to choose config
2025-02-27 23:18:52 +08:00
Zhean Xu
461427ecd0
Merge pull request #27 from vatlor/main
fix typo
2025-02-27 20:37:31 +08:00
dotrail
488b5fc467 fix typo 2025-02-27 11:53:33 +00:00
Chenggang Zhao
6da94d2d36 Add extra TMA checks 2025-02-27 18:20:57 +08:00
Chenggang Zhao
ca13ce0fab Fix TMA store bugs and code format 2025-02-27 17:57:21 +08:00
Chenggang Zhao
b05ed2f017 Code format 2025-02-27 10:50:20 +08:00
Chenggang Zhao
676329b8e2
Merge pull request #19 from dzhulgakov/fix-wheel
Fix wheel building
2025-02-27 10:44:11 +08:00
Chenggang Zhao
6e55da296f Fix python -O mode issues 2025-02-27 10:42:46 +08:00
Chenggang Zhao
d5b974da2b
Merge pull request #16 from AcraeaTerpsicore/patch-1
Fix typos
2025-02-27 10:34:12 +08:00
Dmytro Dzhulgakov
fc7c3f8299 setup.py: fix wheel building 2025-02-26 17:48:57 +00:00
Zhean Xu
78cacf70d4
Update README.md 2025-02-26 19:20:39 +08:00
AcraeaTerpsicore
96b31fd6bb
fix typo 2025-02-26 18:37:22 +08:00
xuzhean
bc989405fe fix: prevent expected_m from exceeding m in test_core 2025-02-26 16:55:47 +08:00
Zhean Xu
eec7ab7f03
Merge pull request #13 from ZeppLu/patch-1
doc: Use permanent link
2025-02-26 16:34:23 +08:00
Zepp
7a70b439cd
doc: Use permanent link 2025-02-26 16:15:37 +08:00
Chenggang Zhao
184ce9b5ea
Merge pull request #3 from acheong08/patch-1
Spelling: README.md
2025-02-26 13:29:45 +08:00
Antonio Cheong
5da24e229a
spelling: README.md
behavior -> behaves
2025-02-26 02:36:04 +00:00
Chenggang Zhao
a6d97a1c1b Initial commit 2025-02-25 22:52:41 +08:00