IshanaSabrish
927eebc10f
fix: Update named barrier thread count to match actual participating threads
...
- Changed kNThreads (256) to 128 in NamedBarrier::arrive calls to match the actual number of threads in warp group
- Fixed potential deadlock issue where barrier was waiting for more threads than would arrive
- Updated both SReady and SoftmaxReady barrier synchronizations
2025-03-01 21:18:05 +05:30
Jiashi Li
480405ada9
fix readme
2025-02-26 20:32:39 +08:00
Jiashi Li
966eedc2f7
Fix readme
2025-02-26 20:30:45 +08:00
Jiashi Li
01d6d40062
Merge pull request #45 from yangsijia-serena/main
...
fix(benchmark): store 'compare' and 'one' perf results in csv files and visualize them
2025-02-26 20:14:40 +08:00
hpp
6492cabb28
add Community Support of [MetaX] and [Moore Threads]
2025-02-26 11:26:42 +08:00
yangsijia.614
b67980309b
fix(benchmark): store 'compare' and 'one' perf results in csv files and visualize them
2025-02-26 00:14:51 +08:00
ljss
4edea86f9e
cuda12.8 recommendation
2025-02-26 00:05:57 +08:00
Jiashi Li
b549289fb4
Merge pull request #32 from sijiac/fp16-support
...
Support FP16 dtype in FlashMLA kenrel
2025-02-25 09:19:42 +08:00
ljss
e1e9fa98f8
Style fix
2025-02-25 09:18:11 +08:00
Sijia Chen
a3b74b8574
add flag to disable FP16 compile
2025-02-24 10:01:59 -08:00
Jiashi Li
18e32770cc
Merge pull request #35 from KnowingNothing/main
...
feat: add benchmark for flash_infer vs flash_mla
2025-02-25 00:41:23 +08:00
Jiashi Li
7d69520ad4
Merge pull request #37 from chunyang-wen/Update-doc-string
...
Update docstring
2025-02-25 00:38:31 +08:00
zhengsize
922f63bdaa
add gitignore for png and csv files in benchmark
2025-02-25 00:38:02 +08:00
chunyang.wen
c4c5912b05
Update docstring
2025-02-25 00:11:57 +08:00
zhengsize
4da4dbd303
feat: add benchmark for flash_infer vs flash_mla
2025-02-24 22:34:22 +08:00
Sijia Chen
65fb7732fc
support fp16
2025-02-24 01:58:53 -08:00
Sijia Chen
15a82b81b8
replace c10 optional with std optional
2025-02-24 00:25:40 -08:00
Jiashi Li
bcb90f2afd
Merge pull request #9 from homorunner/main
...
support Windows build
2025-02-24 13:21:58 +08:00
Jiashi Li
dd1161e396
Merge pull request #14 from lancerts/minor-fix
...
minor fix test
2025-02-24 13:13:58 +08:00
lancerts
4fbaa9527c
minor fix test
2025-02-23 20:12:49 -08:00
Jiashi Li
accc1695ee
Merge pull request #12 from sazczmh/main
...
tests: Triton 3.2.0 had remove the fast_flush parameter from do_bench
2025-02-24 11:57:41 +08:00
程元
e62bdb4d3f
support Windows build
2025-02-24 11:29:36 +08:00
sazc
051e40e82b
tests: Triton had remove the fast_flush parameter from do_bench ( #4485 )
2025-02-24 10:59:22 +08:00
Jiashi Li
414a2f3eed
Initial commit
...
i
2025-02-24 09:20:23 +08:00