Commit Graph

  • 9edee0c022 update .gitignore main ljss 2025-04-29 12:03:15 +0800
  • 9c5dfab6d1 update to cutlass 3.9 ljss 2025-04-29 12:02:57 +0800
  • 01a27728e6 Fix synchronization issues ljss 2025-04-28 18:53:04 +0800
  • 70b9468520
    Fix LaTeX render error (#74) Shengyu Liu 2025-04-23 10:21:14 +0800
  • 69d6df34e5
    Fix LaTeX render error Shengyu Liu 2025-04-23 10:19:21 +0800
  • 6cff5a73f5
    Minor fix to the docs to correct FlashAttention-3's paper link and typos (#73) ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 2025-04-23 05:14:05 +0300
  • 667cf59636
    Minor fix to the docs to correct FlashAttention-3's paper link and typos Hollow Man 2025-04-22 16:46:15 +0300
  • a9444cd67d
    Update README.md (#72) Shengyu Liu 2025-04-22 18:03:14 +0800
  • 60438c1ee4
    Update README.md Shengyu Liu 2025-04-22 18:01:09 +0800
  • c2067be3ea
    Performance Update (2025.04.22) (#71) Shengyu Liu 2025-04-22 17:50:57 +0800
  • 828a19c720 Move flash_mla.h to kernels/params.h Shengyu Liu 2025-04-22 17:46:35 +0800
  • c7123cb36e Use relative path for the schedule image Shengyu Liu 2025-04-22 17:34:05 +0800
  • 65451d68f8 Add background color for MLA Kernel Sched.drawio.svg Shengyu Liu 2025-04-22 17:10:46 +0800
  • 984059a7cf Add the deep-dive blog Shengyu Liu 2025-04-22 17:05:18 +0800
  • c7996e951d Update README.md Shengyu Liu 2025-04-22 17:01:49 +0800
  • 15f3897667 Update comment Shengyu Liu 2025-04-22 16:46:48 +0800
  • 9352b7a790 Update README.md Shengyu Liu 2025-04-21 17:23:26 +0800
  • 69b64823cc Add new testcase (s_k = 16384) Shengyu Liu 2025-04-21 17:23:19 +0800
  • 287061ec34 Performance optimization for compute-bound cases Shengyu Liu 2025-04-21 17:22:59 +0800
  • 063ffa8ec1 Fix benchmark script Shengyu Liu 2025-04-21 15:42:12 +0800
  • 8997d13745 Performance: Async copy for q in advance, roughly 0.5% performance gain zhouzihan30 2025-03-21 16:17:08 +0800
  • d626421fff Fix the bug of fma zihan zhou 2025-03-19 11:10:09 +0800
  • 927eebc10f fix: Update named barrier thread count to match actual participating threads IshanaSabrish 2025-03-01 21:18:05 +0530
  • b31bfe72a8 add missing copyright ljss 2025-03-01 18:24:24 +0800
  • 3e123bc93c
    add community support for [AMD] Jiashi Li 2025-03-01 17:55:58 +0800
  • 7fafcd217d add env chenhongmin.will 2025-03-01 14:44:25 +0800
  • 6199b0b4b5 update desc chenhongmin.will 2025-03-01 07:53:04 +0800
  • 90289837fc update ut chenhongmin.will 2025-03-01 02:14:42 +0800
  • 6684be4371
    Update README.md for AMD MLA Peng 2025-02-28 23:29:32 +0800
  • 9887a5501e update readme chenhongmin.will 2025-02-28 22:18:04 +0800
  • c7143a7bda Merge branch 'main' into will_fp8_mr chenhongmin.will 2025-02-28 22:07:03 +0800
  • 8b939854d8 enable scale chenhongmin.will 2025-02-28 19:43:24 +0800
  • 4e055a6142 reorg ut chenhongmin.will 2025-02-28 18:59:02 +0800
  • bfe38ab106 fix combine chenhongmin.will 2025-02-28 18:45:09 +0800
  • fd1e662deb fix mma0 chenhongmin.will 2025-02-28 16:52:30 +0800
  • 061af5fc56 use fa'3 transv chenhongmin.will 2025-02-28 14:35:07 +0800
  • 0337732dc1 reorg chenhongmin.will 2025-02-28 08:09:02 +0800
  • 1df91aff33 fix compile chenhongmin.will 2025-02-27 23:40:02 +0800
  • 855c985b00 use 64x64 transpose_v chenhongmin.will 2025-02-27 22:28:45 +0800
  • d1689ab64f use mm1's Aregs instead of mma0's Cregs chenhongmin.will 2025-02-27 10:56:43 +0800
  • 1aef31d163 reformat Community Support section hpp 2025-02-27 09:42:09 +0800
  • 77d9d8d21b add Community Support of [Hygon DCU] [Intellifusion] [Iluvatar Corex] hpp 2025-02-27 09:40:47 +0800
  • 4430e398d9 add Community Support of [Hygon DCU] [Intellifusion] [Iluvatar Corex] hpp 2025-02-27 09:39:18 +0800
  • 1757a6db07 try fix chenhongmin.will 2025-02-27 09:11:17 +0800
  • dbd8c307eb fix sV chenhongmin.will 2025-02-26 22:06:04 +0800
  • 480405ada9
    fix readme Jiashi Li 2025-02-26 20:32:39 +0800
  • 966eedc2f7
    Fix readme Jiashi Li 2025-02-26 20:30:45 +0800
  • 01d6d40062
    Merge pull request #45 from yangsijia-serena/main Jiashi Li 2025-02-26 20:14:40 +0800
  • 6dcea4952c add TransV chenhongmin.will 2025-02-26 18:37:36 +0800
  • 6a4eb631e2 add transv barrier chenhongmin.will 2025-02-26 17:57:00 +0800
  • 59f691763e fix Vt illegal chenhongmin.will 2025-02-26 17:25:52 +0800
  • 29de9e0c79 debug mode chenhongmin.will 2025-02-26 16:03:17 +0800
  • 6492cabb28 add Community Support of [MetaX] and [Moore Threads] hpp 2025-02-26 11:26:42 +0800
  • f6fab1b915 change to use per_tensor chenhongmin.will 2025-02-26 10:17:29 +0800
  • 4b314cd655 update fp8 api chenhongmin.will 2025-02-26 08:32:05 +0800
  • ef644a56e0 update ut chenhongmin.will 2025-02-26 08:13:56 +0800
  • 870418802a add fp8 ut chenhongmin.will 2025-02-25 23:29:18 +0800
  • b67980309b fix(benchmark): store 'compare' and 'one' perf results in csv files and visualize them yangsijia.614 2025-02-25 23:52:54 +0800
  • 4edea86f9e cuda12.8 recommendation ljss 2025-02-26 00:05:57 +0800
  • dfe8ffc75a enable fp8 api chenhongmin.will 2025-02-25 22:34:01 +0800
  • c50d29d170 fix compile chenhongmin.will 2025-02-25 21:30:39 +0800
  • 7409203f44 enable fp8 compile chenhongmin.will 2025-02-25 17:48:07 +0800
  • fed0499301 fp8 shared mem chenhongmin.will 2025-02-25 11:08:28 +0800
  • e09c057c61 Merge branch 'main' into packaging Xuehai Pan 2025-02-25 11:09:38 +0800
  • b67a18f850 update gmem chenhongmin.will 2025-02-25 09:40:56 +0800
  • b549289fb4
    Merge pull request #32 from sijiac/fp16-support Jiashi Li 2025-02-25 09:19:42 +0800
  • e1e9fa98f8 Style fix ljss 2025-02-25 09:18:11 +0800
  • d833dbd711 enable fp8 chenhongmin.will 2025-02-25 09:03:02 +0800
  • 26d3077949 chore(setup): properly package the repository as a Python package Xuehai Pan 2025-02-24 18:18:38 +0800
  • a3b74b8574 add flag to disable FP16 compile Sijia Chen 2025-02-24 10:01:59 -0800
  • 0ad79739de
    add requirements.txt Yuan Chen 2025-02-24 11:49:56 -0500
  • 18e32770cc
    Merge pull request #35 from KnowingNothing/main Jiashi Li 2025-02-25 00:41:23 +0800
  • 7d69520ad4
    Merge pull request #37 from chunyang-wen/Update-doc-string Jiashi Li 2025-02-25 00:38:31 +0800
  • 922f63bdaa add gitignore for png and csv files in benchmark zhengsize 2025-02-24 23:58:52 +0800
  • c4c5912b05 Update docstring chunyang.wen 2025-02-25 00:11:57 +0800
  • e0557deb3a Feature:Support flashMLA decoding via flashAttn2(#29) Changes: 1. Implement flashMLA with matrix absorption algorithm via flashAttn2 2. Add golden test on MXMACA platform Kevin Zhang 2025-02-24 23:55:21 +0800
  • 4da4dbd303 feat: add benchmark for flash_infer vs flash_mla zhengsize 2025-02-24 22:34:22 +0800
  • dae0690055 init fp8 chenhongmin.will 2025-02-24 21:12:36 +0800
  • 65fb7732fc support fp16 Sijia Chen 2025-02-24 01:58:53 -0800
  • 15a82b81b8 replace c10 optional with std optional Sijia Chen 2025-02-24 00:25:25 -0800
  • bcb90f2afd
    Merge pull request #9 from homorunner/main Jiashi Li 2025-02-24 13:21:58 +0800
  • dd1161e396
    Merge pull request #14 from lancerts/minor-fix Jiashi Li 2025-02-24 13:13:58 +0800
  • 4fbaa9527c minor fix test lancerts 2025-02-23 20:12:49 -0800
  • 33e110bb66 implement the index Gareth Jones 2025-02-23 20:08:19 -0800
  • accc1695ee
    Merge pull request #12 from sazczmh/main Jiashi Li 2025-02-24 11:57:41 +0800
  • e62bdb4d3f support Windows build 程元 2025-02-24 11:29:36 +0800
  • 051e40e82b tests: Triton had remove the fast_flush parameter from do_bench (#4485) sazc 2025-02-24 10:59:22 +0800
  • 46bafd9e03 Cache output stride parameters in registers to reduce global loads Gareth Jones 2025-02-23 18:45:40 -0800
  • ccb208bcac Cache output stride parameters in registers to reduce global loads Gareth Jones 2025-02-23 18:44:25 -0800
  • 5fb94d668f Stage accumulator fragment to shared memory using tiled copy Gareth Jones 2025-02-23 18:38:15 -0800
  • 9f361aa02e Stage accumulator fragment to shared memory using tiled copy Gareth Jones 2025-02-23 18:23:07 -0800
  • 414a2f3eed Initial commit Jiashi Li 2025-02-21 14:31:27 +0800