Support TMA multicast on B with m_grouped_gemm_contiguous. (#88)

2025-06-26 23:15:49 +00:00 · 2025-04-21 09:43:17 +08:00
parent 83aa960b9b
commit 891f35adf5
5 changed files with 74 additions and 31 deletions
--- a/README.md
+++ b/README.md
@@ -12,10 +12,11 @@ Despite its lightweight design, DeepGEMM's performance matches or exceeds expert

 ## Roadmap

- [ ] More correctness tests for grouped-contiguous layout
+- [x] More correctness tests for grouped-contiguous layout
 - [x] Shared memory swizzling for output
 - [ ] Larger block size on N (up to 256)
- [ ] MoE scheduler with TMA multicast compatibility
+- [x] MoE scheduler with TMA multicast compatibility
+- [ ] Fix TMA multicast compatibility for indivisible shapes
 - [ ] Weight gradient kernels for dense models
 - [ ] Weight gradient kernels for MoE models
 - [ ] Utility kernels for MoE models (as a pre-built CUDA library)