From d626421fff08a24a35503aab3f1b2f30d41ff513 Mon Sep 17 00:00:00 2001
From: zihan zhou <15645113830zzh@gmail.com>
Date: Wed, 19 Mar 2025 11:10:09 +0800
Subject: [PATCH] Fix the bug of fma

Hi, I find in scale_apply_exp2, The code comments also mentioned this issue:
https://github.com/pytorch/pytorch/issues/121558

This issue is that the ffma instruction generates some calculation errors
during the flash attention compared to fadd and fmul separated.

For fadd and fmul, the calculation is:
round_fp32(x_i * scale) - round_fp32(x_i * scale)
For max(x), this value is 0.

But For ffma, the calculation is:
x_i * scale - round_fp32(x_i * scale)
Although the accuracy of ffma calculations has actually improved,
there have been errors in the values.

We can raise this issue by changing the initialization value of q k,
and the final outs will all be 0:

q = torch.full((b, s_q, h_q, d), 133120.0)
blocked_k = torch.full((block_table.numel(), block_size, h_kv, d), 133120.0)

If we define UNFUSE_FMA, This problem has been alleviated, but it still
cannot pass the cal-diff test. I am not sure if it is an accuracy issue,
but I think it is necessary to fix the fma bug first.
---
 setup.py | 1 +
 1 file changed, 1 insertion(+)

diff --git a/setup.py b/setup.py
index cd311f2..90affd1 100644
--- a/setup.py
+++ b/setup.py
@@ -65,6 +65,7 @@ ext_modules.append(
                     "-std=c++17",
                     "-DNDEBUG",
                     "-D_USE_MATH_DEFINES",
+                    "-DUNFUSE_FMA",
                     "-Wno-deprecated-declarations",
                     "-U__CUDA_NO_HALF_OPERATORS__",
                     "-U__CUDA_NO_HALF_CONVERSIONS__",