FlashMLA

mirror of https://github.com/deepseek-ai/FlashMLA synced 2025-06-26 18:15:54 +00:00

Files

IshanaSabrish 927eebc10f fix: Update named barrier thread count to match actual participating threads

- Changed kNThreads (256) to 128 in NamedBarrier::arrive calls to match the actual number of threads in warp group
- Fixed potential deadlock issue where barrier was waiting for more threads than would arrive
- Updated both SReady and SoftmaxReady barrier synchronizations

2025-03-01 21:18:05 +05:30

cutlass @ afa1772203

Initial commit

2025-02-24 09:20:23 +08:00

flash_api.cpp

add flag to disable FP16 compile

2025-02-24 10:01:59 -08:00

flash_fwd_mla_bf16_sm90.cu

Initial commit

2025-02-24 09:20:23 +08:00

flash_fwd_mla_fp16_sm90.cu

support fp16

2025-02-24 01:58:53 -08:00

flash_fwd_mla_kernel.h

fix: Update named barrier thread count to match actual participating threads

2025-03-01 21:18:05 +05:30

flash_fwd_mla_metadata.cu

support fp16

2025-02-24 01:58:53 -08:00

flash_mla.h

Initial commit

2025-02-24 09:20:23 +08:00

named_barrier.h

Initial commit

2025-02-24 09:20:23 +08:00

softmax.h

Initial commit

2025-02-24 09:20:23 +08:00

static_switch.h

Initial commit

2025-02-24 09:20:23 +08:00

utils.h

Initial commit

2025-02-24 09:20:23 +08:00