FlashMLA

mirror of https://github.com/deepseek-ai/FlashMLA synced 2025-06-26 18:15:54 +00:00

Author	SHA1	Message	Date
zihan zhou	d626421fff	Fix the bug of fma Hi, I find in scale_apply_exp2, The code comments also mentioned this issue: https://github.com/pytorch/pytorch/issues/121558 This issue is that the ffma instruction generates some calculation errors during the flash attention compared to fadd and fmul separated. For fadd and fmul, the calculation is: round_fp32(x_i * scale) - round_fp32(x_i * scale) For max(x), this value is 0. But For ffma, the calculation is: x_i * scale - round_fp32(x_i * scale) Although the accuracy of ffma calculations has actually improved, there have been errors in the values. We can raise this issue by changing the initialization value of q k, and the final outs will all be 0: q = torch.full((b, s_q, h_q, d), 133120.0) blocked_k = torch.full((block_table.numel(), block_size, h_kv, d), 133120.0) If we define UNFUSE_FMA, This problem has been alleviated, but it still cannot pass the cal-diff test. I am not sure if it is an accuracy issue, but I think it is necessary to fix the fma bug first.	2025-03-19 11:11:05 +08:00
ljss	4edea86f9e	cuda12.8 recommendation	2025-02-26 00:05:57 +08:00
Sijia Chen	a3b74b8574	add flag to disable FP16 compile	2025-02-24 10:01:59 -08:00
Sijia Chen	65fb7732fc	support fp16	2025-02-24 01:58:53 -08:00
程元	e62bdb4d3f	support Windows build	2025-02-24 11:29:36 +08:00
Jiashi Li	414a2f3eed	Initial commit i	2025-02-24 09:20:23 +08:00