DeepGEMM

mirror of https://github.com/deepseek-ai/DeepGEMM synced 2025-05-31 18:48:16 +00:00

Author	SHA1	Message	Date
Chenggang Zhao	8702f910e3	Fix 12.9 compatibility	2025-05-07 13:23:40 +08:00
Gabriel Wu	bfe983c4c2	Refactor JIT compilation (+NVRTC support) (#94 ) * [wip] refactor: compile to .cubin Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * refactor: compile to .cubin and add NVRTC option Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * fix: compiler version Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * feat: compat for old drivers Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * feat: save kernel name to file Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * feat: fix win compat Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * fix: windows compat Signed-off-by: Gabriel Wu <13583761+lucifer1004@users.noreply.github.com> * feat: make API more general Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * feat: drop support for CUDA<12.3 Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * doc: update README Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * Some lints and refactor * Refactor runtime * Several fixes * Refactor environment variables * Code format * Add a TODO * Compatible with CUDA 12.3 * Fix indent * Fix typing * Drop support for Windows * Add a TODO --------- Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> Signed-off-by: Gabriel Wu <13583761+lucifer1004@users.noreply.github.com> Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>	2025-05-07 11:38:14 +08:00
Chenggang Zhao	37aa127451	Use swizzling instead of padding (#86 ) * Add swizzling params * Add TMA D descriptor * Always use STSMx2 * Swizzling draft * Compatible with padding * Fix bugs * Optimize swizzle performance * Optimize expression * Optimize TMA issues * Fix README * Stricter assertions	2025-04-14 15:20:58 +08:00
Chenggang Zhao	d14962f072	Add `DG_NVCC_OVERRIDE_CPP_STANDARD`	2025-04-03 15:53:29 +08:00
Chenggang Zhao	3a5539b7db	Use `c++20`	2025-04-03 15:47:59 +08:00
Chenggang Zhao	6db7e1863b	Solve STSM bank conflict via padding and 3D TMA	2025-04-03 15:39:35 +08:00
YLGH	b7db15ce94	Update nvcc flag c++20 Needed for fconcepts	2025-03-25 14:15:39 -07:00
Chenggang Zhao	7768319ffe	Remove unaligned predicates	2025-03-25 16:32:40 +08:00
Chenggang Zhao	7ffb118e54	Support multicasting on B	2025-03-25 14:56:42 +08:00
Chenggang Zhao	bd2a775528	Code format	2025-03-11 13:26:10 +08:00
Chenggang Zhao	5233bad1e9	Merge pull request #55 from sleepcoo/fix-cudagraph fix cuda_graph rng check error	2025-03-11 13:25:35 +08:00
sleepcoo	723a00338e	fix cuda_graph rng check error	2025-03-11 12:40:42 +08:00
sazc	fcd1dcd99d	Performance: reducing the percentage of FFMA interleaving yields a slight performance gain, roughly 0.5%	2025-03-05 17:50:22 +08:00
Chenggang Zhao	ca13ce0fab	Fix TMA store bugs and code format	2025-02-27 17:57:21 +08:00
Chenggang Zhao	6e55da296f	Fix `python -O` mode issues	2025-02-27 10:42:46 +08:00
Chenggang Zhao	a6d97a1c1b	Initial commit	2025-02-25 22:52:41 +08:00

16 Commits