Commit Graph

29 Commits

Author SHA1 Message Date
zhouzihan30
8997d13745 Performance: Async copy for q in advance, roughly 0.5% performance gain
Hi, I think we can optimize the process that load q from gmem to smem.
In function compute_attn_1rowblock_splitkv_mla, after the last calculation
of q @ k_T, we no longer need q. So we can use async copy before the
next function compute_attn_1rowblock_splitkv_mla run, This means that load
the next q to the smem in advance.

In order to prevent the valid values from being overwritten in smem,
I adjusted the layout of SharedStorageMLA, and use test_flash_mla.py
to test. The test can pass normally without any calculation errors.

I tested it on H800, and I use the average of 10 tests as the final result,
each test interval is 3 seconds to stabilize the GPU frequency.

The number of times to load q is very small, so this does not bring much
performance improvement. Under some parameters, there is a slight decrease
in performance, but it is gratifying that there is a roughly 0.5%
performance improvement overall.

batch,seqlen,head,bw_orig,bw_opt,bw_diff_percentage
64,1087,128,1384,1407,1.66%
64,2111,128,1744,1761,0.97%
64,4159,128,2188,2197,0.41%
64,8255,128,2341,2345,0.17%
64,16447,128,2330,2338,0.34%
64,32831,128,2374,2374,0.0%
128,1151,128,1756,1763,0.4%
128,2175,128,2066,2072,0.29%
128,4223,128,2284,2290,0.26%
128,8319,128,2343,2349,0.26%
128,16511,128,2375,2373,-0.08%
128,32895,128,2351,2358,0.3%
256,1279,128,2033,2035,0.1%
256,2303,128,2232,2228,-0.18%
256,4351,128,2322,2340,0.78%
256,8447,128,2371,2367,-0.17%
256,16639,128,2359,2394,1.48%
256,33023,128,2381,2392,0.46%

Thanks!
2025-03-26 14:20:28 +08:00
ljss
b31bfe72a8 add missing copyright 2025-03-01 18:24:24 +08:00
Jiashi Li
3e123bc93c
add community support for [AMD] 2025-03-01 17:55:58 +08:00
hpp
1aef31d163 reformat Community Support section 2025-02-27 09:42:09 +08:00
hpp
77d9d8d21b add Community Support of [Hygon DCU] [Intellifusion] [Iluvatar Corex] 2025-02-27 09:40:47 +08:00
hpp
4430e398d9 add Community Support of [Hygon DCU] [Intellifusion] [Iluvatar Corex] 2025-02-27 09:39:18 +08:00
Jiashi Li
480405ada9
fix readme 2025-02-26 20:32:39 +08:00
Jiashi Li
966eedc2f7
Fix readme 2025-02-26 20:30:45 +08:00
Jiashi Li
01d6d40062
Merge pull request #45 from yangsijia-serena/main
fix(benchmark): store 'compare' and 'one' perf results in csv files and visualize them
2025-02-26 20:14:40 +08:00
hpp
6492cabb28 add Community Support of [MetaX] and [Moore Threads] 2025-02-26 11:26:42 +08:00
yangsijia.614
b67980309b fix(benchmark): store 'compare' and 'one' perf results in csv files and visualize them 2025-02-26 00:14:51 +08:00
ljss
4edea86f9e cuda12.8 recommendation 2025-02-26 00:05:57 +08:00
Jiashi Li
b549289fb4
Merge pull request #32 from sijiac/fp16-support
Support FP16 dtype in FlashMLA kenrel
2025-02-25 09:19:42 +08:00
ljss
e1e9fa98f8 Style fix 2025-02-25 09:18:11 +08:00
Sijia Chen
a3b74b8574 add flag to disable FP16 compile 2025-02-24 10:01:59 -08:00
Jiashi Li
18e32770cc
Merge pull request #35 from KnowingNothing/main
feat: add benchmark for flash_infer vs flash_mla
2025-02-25 00:41:23 +08:00
Jiashi Li
7d69520ad4
Merge pull request #37 from chunyang-wen/Update-doc-string
Update docstring
2025-02-25 00:38:31 +08:00
zhengsize
922f63bdaa add gitignore for png and csv files in benchmark 2025-02-25 00:38:02 +08:00
chunyang.wen
c4c5912b05 Update docstring 2025-02-25 00:11:57 +08:00
zhengsize
4da4dbd303 feat: add benchmark for flash_infer vs flash_mla 2025-02-24 22:34:22 +08:00
Sijia Chen
65fb7732fc support fp16 2025-02-24 01:58:53 -08:00
Sijia Chen
15a82b81b8 replace c10 optional with std optional 2025-02-24 00:25:40 -08:00
Jiashi Li
bcb90f2afd
Merge pull request #9 from homorunner/main
support Windows build
2025-02-24 13:21:58 +08:00
Jiashi Li
dd1161e396
Merge pull request #14 from lancerts/minor-fix
minor fix test
2025-02-24 13:13:58 +08:00
lancerts
4fbaa9527c minor fix test 2025-02-23 20:12:49 -08:00
Jiashi Li
accc1695ee
Merge pull request #12 from sazczmh/main
tests: Triton 3.2.0 had remove the fast_flush parameter from do_bench
2025-02-24 11:57:41 +08:00
程元
e62bdb4d3f support Windows build 2025-02-24 11:29:36 +08:00
sazc
051e40e82b tests: Triton had remove the fast_flush parameter from do_bench (#4485) 2025-02-24 10:59:22 +08:00
Jiashi Li
414a2f3eed Initial commit
i
2025-02-24 09:20:23 +08:00