DeepGEMM

mirror of https://github.com/deepseek-ai/DeepGEMM synced 2025-06-26 23:15:49 +00:00

Author	SHA1	Message	Date
google-labs-jules[bot]	93ea4797c0	Add initial support for Nvidia Blackwell (SM120) This change introduces the necessary compiler flags and CMake configurations to enable support for the Nvidia Blackwell SM120 architecture. - Modified deep_gemm/jit/compiler.py to include sm_120 and compute_120 flags for NVCC and NVRTC. - Updated CMakeLists.txt to add the new architecture flags for the build process. Further testing on Blackwell hardware is required to validate MMA instruction compatibility and overall performance.	2025-06-24 00:30:35 +00:00
yukuai26	8dfa329827	Grouped GEMM skip useless computation for unaligned Ms (#103 ) * Grouped GEMM skip useless computation for unaligned Ms * Update readme.md * small typo * Rename variables * Restore previous indent * Format * Refactor tests * Add `SkipComputation` types * Bug fixed * Format * Fix tests * Add assertions * Minor fix --------- Co-authored-by: yukuai <yukuai@deepseek.com> Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>	2025-05-27 13:43:38 +08:00
Chenggang Zhao	391755ada0	Fix JIT tests	2025-05-16 14:39:58 +08:00
Chenggang Zhao	104a6ec109	Add `__assertfail`	2025-05-15 17:04:21 +08:00
Chenggang Zhao	3b412f458a	Unify `kwargs` usages	2025-05-15 16:53:52 +08:00
Chenggang Zhao	4373af2e82	Add `DG_PRINT_CONFIGS`	2025-05-15 16:36:40 +08:00
Chenggang Zhao	816b39053a	Refactor launch-related structures	2025-05-15 16:14:21 +08:00
Chenggang Zhao	8702f910e3	Fix 12.9 compatibility	2025-05-07 13:23:40 +08:00
Gabriel Wu	bfe983c4c2	Refactor JIT compilation (+NVRTC support) (#94 ) * [wip] refactor: compile to .cubin Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * refactor: compile to .cubin and add NVRTC option Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * fix: compiler version Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * feat: compat for old drivers Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * feat: save kernel name to file Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * feat: fix win compat Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * fix: windows compat Signed-off-by: Gabriel Wu <13583761+lucifer1004@users.noreply.github.com> * feat: make API more general Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * feat: drop support for CUDA<12.3 Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * doc: update README Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * Some lints and refactor * Refactor runtime * Several fixes * Refactor environment variables * Code format * Add a TODO * Compatible with CUDA 12.3 * Fix indent * Fix typing * Drop support for Windows * Add a TODO --------- Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> Signed-off-by: Gabriel Wu <13583761+lucifer1004@users.noreply.github.com> Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>	2025-05-07 11:38:14 +08:00
Chenggang Zhao	37aa127451	Use swizzling instead of padding (#86 ) * Add swizzling params * Add TMA D descriptor * Always use STSMx2 * Swizzling draft * Compatible with padding * Fix bugs * Optimize swizzle performance * Optimize expression * Optimize TMA issues * Fix README * Stricter assertions	2025-04-14 15:20:58 +08:00
Chenggang Zhao	d14962f072	Add `DG_NVCC_OVERRIDE_CPP_STANDARD`	2025-04-03 15:53:29 +08:00
Chenggang Zhao	3a5539b7db	Use `c++20`	2025-04-03 15:47:59 +08:00
Chenggang Zhao	6db7e1863b	Solve STSM bank conflict via padding and 3D TMA	2025-04-03 15:39:35 +08:00
YLGH	b7db15ce94	Update nvcc flag c++20 Needed for fconcepts	2025-03-25 14:15:39 -07:00
Chenggang Zhao	7768319ffe	Remove unaligned predicates	2025-03-25 16:32:40 +08:00
Chenggang Zhao	7ffb118e54	Support multicasting on B	2025-03-25 14:56:42 +08:00
Chenggang Zhao	bd2a775528	Code format	2025-03-11 13:26:10 +08:00
Chenggang Zhao	5233bad1e9	Merge pull request #55 from sleepcoo/fix-cudagraph fix cuda_graph rng check error	2025-03-11 13:25:35 +08:00
sleepcoo	723a00338e	fix cuda_graph rng check error	2025-03-11 12:40:42 +08:00
sazc	fcd1dcd99d	Performance: reducing the percentage of FFMA interleaving yields a slight performance gain, roughly 0.5%	2025-03-05 17:50:22 +08:00
Chenggang Zhao	ca13ce0fab	Fix TMA store bugs and code format	2025-02-27 17:57:21 +08:00
Chenggang Zhao	6e55da296f	Fix `python -O` mode issues	2025-02-27 10:42:46 +08:00
Chenggang Zhao	a6d97a1c1b	Initial commit	2025-02-25 22:52:41 +08:00

23 Commits