Support TMA multicast on B with m_grouped_gemm_contiguous. (#88)

This commit is contained in:
yukuai26
2025-04-21 09:43:17 +08:00
committed by GitHub
parent 83aa960b9b
commit 891f35adf5
5 changed files with 74 additions and 31 deletions

View File

@@ -12,10 +12,11 @@ Despite its lightweight design, DeepGEMM's performance matches or exceeds expert
## Roadmap
- [ ] More correctness tests for grouped-contiguous layout
- [x] More correctness tests for grouped-contiguous layout
- [x] Shared memory swizzling for output
- [ ] Larger block size on N (up to 256)
- [ ] MoE scheduler with TMA multicast compatibility
- [x] MoE scheduler with TMA multicast compatibility
- [ ] Fix TMA multicast compatibility for indivisible shapes
- [ ] Weight gradient kernels for dense models
- [ ] Weight gradient kernels for MoE models
- [ ] Utility kernels for MoE models (as a pre-built CUDA library)