mirror of
https://github.com/deepseek-ai/DeepGEMM
synced 2025-06-26 23:15:49 +00:00
Support TMA multicast on B with m_grouped_gemm_contiguous. (#88)
This commit is contained in:
@@ -12,10 +12,11 @@ Despite its lightweight design, DeepGEMM's performance matches or exceeds expert
|
||||
|
||||
## Roadmap
|
||||
|
||||
- [ ] More correctness tests for grouped-contiguous layout
|
||||
- [x] More correctness tests for grouped-contiguous layout
|
||||
- [x] Shared memory swizzling for output
|
||||
- [ ] Larger block size on N (up to 256)
|
||||
- [ ] MoE scheduler with TMA multicast compatibility
|
||||
- [x] MoE scheduler with TMA multicast compatibility
|
||||
- [ ] Fix TMA multicast compatibility for indivisible shapes
|
||||
- [ ] Weight gradient kernels for dense models
|
||||
- [ ] Weight gradient kernels for MoE models
|
||||
- [ ] Utility kernels for MoE models (as a pre-built CUDA library)
|
||||
|
||||
Reference in New Issue
Block a user