FlashMLA

mirror of https://github.com/deepseek-ai/FlashMLA synced 2025-06-26 18:15:54 +00:00

Author	SHA1	Message	Date
zihan zhou	d626421fff	Fix the bug of fma Hi, I find in scale_apply_exp2, The code comments also mentioned this issue: https://github.com/pytorch/pytorch/issues/121558 This issue is that the ffma instruction generates some calculation errors during the flash attention compared to fadd and fmul separated. For fadd and fmul, the calculation is: round_fp32(x_i * scale) - round_fp32(x_i * scale) For max(x), this value is 0. But For ffma, the calculation is: x_i * scale - round_fp32(x_i * scale) Although the accuracy of ffma calculations has actually improved, there have been errors in the values. We can raise this issue by changing the initialization value of q k, and the final outs will all be 0: q = torch.full((b, s_q, h_q, d), 133120.0) blocked_k = torch.full((block_table.numel(), block_size, h_kv, d), 133120.0) If we define UNFUSE_FMA, This problem has been alleviated, but it still cannot pass the cal-diff test. I am not sure if it is an accuracy issue, but I think it is necessary to fix the fma bug first.	2025-03-19 11:11:05 +08:00
ljss	b31bfe72a8	add missing copyright	2025-03-01 18:24:24 +08:00
Jiashi Li	3e123bc93c	add community support for [AMD]	2025-03-01 17:55:58 +08:00
hpp	1aef31d163	reformat Community Support section	2025-02-27 09:42:09 +08:00
hpp	77d9d8d21b	add Community Support of [Hygon DCU] [Intellifusion] [Iluvatar Corex]	2025-02-27 09:40:47 +08:00
hpp	4430e398d9	add Community Support of [Hygon DCU] [Intellifusion] [Iluvatar Corex]	2025-02-27 09:39:18 +08:00
Jiashi Li	480405ada9	fix readme	2025-02-26 20:32:39 +08:00
Jiashi Li	966eedc2f7	Fix readme	2025-02-26 20:30:45 +08:00
Jiashi Li	01d6d40062	Merge pull request #45 from yangsijia-serena/main fix(benchmark): store 'compare' and 'one' perf results in csv files and visualize them	2025-02-26 20:14:40 +08:00
hpp	6492cabb28	add Community Support of [MetaX] and [Moore Threads]	2025-02-26 11:26:42 +08:00
yangsijia.614	b67980309b	fix(benchmark): store 'compare' and 'one' perf results in csv files and visualize them	2025-02-26 00:14:51 +08:00
ljss	4edea86f9e	cuda12.8 recommendation	2025-02-26 00:05:57 +08:00
Jiashi Li	b549289fb4	Merge pull request #32 from sijiac/fp16-support Support FP16 dtype in FlashMLA kenrel	2025-02-25 09:19:42 +08:00
ljss	e1e9fa98f8	Style fix	2025-02-25 09:18:11 +08:00
Sijia Chen	a3b74b8574	add flag to disable FP16 compile	2025-02-24 10:01:59 -08:00
Jiashi Li	18e32770cc	Merge pull request #35 from KnowingNothing/main feat: add benchmark for flash_infer vs flash_mla	2025-02-25 00:41:23 +08:00
Jiashi Li	7d69520ad4	Merge pull request #37 from chunyang-wen/Update-doc-string Update docstring	2025-02-25 00:38:31 +08:00
zhengsize	922f63bdaa	add gitignore for png and csv files in benchmark	2025-02-25 00:38:02 +08:00
chunyang.wen	c4c5912b05	Update docstring	2025-02-25 00:11:57 +08:00
zhengsize	4da4dbd303	feat: add benchmark for flash_infer vs flash_mla	2025-02-24 22:34:22 +08:00
Sijia Chen	65fb7732fc	support fp16	2025-02-24 01:58:53 -08:00
Sijia Chen	15a82b81b8	replace c10 optional with std optional	2025-02-24 00:25:40 -08:00
Jiashi Li	bcb90f2afd	Merge pull request #9 from homorunner/main support Windows build	2025-02-24 13:21:58 +08:00
Jiashi Li	dd1161e396	Merge pull request #14 from lancerts/minor-fix minor fix test	2025-02-24 13:13:58 +08:00
lancerts	4fbaa9527c	minor fix test	2025-02-23 20:12:49 -08:00
Jiashi Li	accc1695ee	Merge pull request #12 from sazczmh/main tests: Triton 3.2.0 had remove the fast_flush parameter from do_bench	2025-02-24 11:57:41 +08:00
程元	e62bdb4d3f	support Windows build	2025-02-24 11:29:36 +08:00
sazc	051e40e82b	tests: Triton had remove the fast_flush parameter from do_bench (#4485 )	2025-02-24 10:59:22 +08:00
Jiashi Li	414a2f3eed	Initial commit i	2025-02-24 09:20:23 +08:00

29 Commits