Performance Update (2025.04.22) (#71)

* Fix benchmark script

* Performance optimization for compute-bound cases

* Add new testcase (s_k = 16384)

* Update README.md

* Update comment

* Update README.md

* Add the deep-dive blog

* Add background color for MLA Kernel Sched.drawio.svg

* Use relative path for the schedule image

* Move flash_mla.h to kernels/params.h
This commit is contained in:
Shengyu Liu
2025-04-22 17:50:57 +08:00
committed by GitHub
parent b31bfe72a8
commit c2067be3ea
25 changed files with 2757 additions and 1228 deletions

View File

@@ -127,7 +127,7 @@ def main(torch_dtype):
causal = True
for b in [128]:
for s in [4096, 8192]:
for s in [4096, 8192, 16384]:
for h_q in [16, 32, 64, 128]: # TP = 8, 4, 2, 1
for s_q in [1, 2]: # MTP = 1, 2
for varlen in [False, True]: