google-labs-jules[bot]
93ea4797c0
Add initial support for Nvidia Blackwell (SM120)
...
This change introduces the necessary compiler flags and CMake configurations to enable support for the Nvidia Blackwell SM120 architecture.
- Modified deep_gemm/jit/compiler.py to include sm_120 and compute_120 flags for NVCC and NVRTC.
- Updated CMakeLists.txt to add the new architecture flags for the build process.
Further testing on Blackwell hardware is required to validate MMA instruction compatibility and overall performance.
2025-06-24 00:30:35 +00:00
yukuai26
8dfa329827
Grouped GEMM skip useless computation for unaligned Ms ( #103 )
...
* Grouped GEMM skip useless computation for unaligned Ms
* Update readme.md
* small typo
* Rename variables
* Restore previous indent
* Format
* Refactor tests
* Add `SkipComputation` types
* Bug fixed
* Format
* Fix tests
* Add assertions
* Minor fix
---------
Co-authored-by: yukuai <yukuai@deepseek.com >
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com >
2025-05-27 13:43:38 +08:00
Chenggang Zhao
391755ada0
Fix JIT tests
2025-05-16 14:39:58 +08:00
Chenggang Zhao
104a6ec109
Add __assertfail
2025-05-15 17:04:21 +08:00
Chenggang Zhao
3b412f458a
Unify kwargs usages
2025-05-15 16:53:52 +08:00
Chenggang Zhao
4373af2e82
Add DG_PRINT_CONFIGS
2025-05-15 16:36:40 +08:00
Chenggang Zhao
816b39053a
Refactor launch-related structures
2025-05-15 16:14:21 +08:00
Chenggang Zhao
8702f910e3
Fix 12.9 compatibility
2025-05-07 13:23:40 +08:00
Gabriel Wu
bfe983c4c2
Refactor JIT compilation (+NVRTC support) ( #94 )
...
* [wip] refactor: compile to .cubin
Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com >
* refactor: compile to .cubin and add NVRTC option
Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com >
* fix: compiler version
Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com >
* feat: compat for old drivers
Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com >
* feat: save kernel name to file
Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com >
* feat: fix win compat
Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com >
* fix: windows compat
Signed-off-by: Gabriel Wu <13583761+lucifer1004@users.noreply.github.com >
* feat: make API more general
Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com >
* feat: drop support for CUDA<12.3
Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com >
* doc: update README
Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com >
* Some lints and refactor
* Refactor runtime
* Several fixes
* Refactor environment variables
* Code format
* Add a TODO
* Compatible with CUDA 12.3
* Fix indent
* Fix typing
* Drop support for Windows
* Add a TODO
---------
Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com >
Signed-off-by: Gabriel Wu <13583761+lucifer1004@users.noreply.github.com >
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com >
2025-05-07 11:38:14 +08:00
Chenggang Zhao
37aa127451
Use swizzling instead of padding ( #86 )
...
* Add swizzling params
* Add TMA D descriptor
* Always use STSMx2
* Swizzling draft
* Compatible with padding
* Fix bugs
* Optimize swizzle performance
* Optimize expression
* Optimize TMA issues
* Fix README
* Stricter assertions
2025-04-14 15:20:58 +08:00
Chenggang Zhao
d14962f072
Add DG_NVCC_OVERRIDE_CPP_STANDARD
2025-04-03 15:53:29 +08:00
Chenggang Zhao
3a5539b7db
Use c++20
2025-04-03 15:47:59 +08:00
Chenggang Zhao
6db7e1863b
Solve STSM bank conflict via padding and 3D TMA
2025-04-03 15:39:35 +08:00
YLGH
b7db15ce94
Update nvcc flag c++20
...
Needed for fconcepts
2025-03-25 14:15:39 -07:00
Chenggang Zhao
7768319ffe
Remove unaligned predicates
2025-03-25 16:32:40 +08:00
Chenggang Zhao
7ffb118e54
Support multicasting on B
2025-03-25 14:56:42 +08:00
Chenggang Zhao
bd2a775528
Code format
2025-03-11 13:26:10 +08:00
Chenggang Zhao
5233bad1e9
Merge pull request #55 from sleepcoo/fix-cudagraph
...
fix cuda_graph rng check error
2025-03-11 13:25:35 +08:00
sleepcoo
723a00338e
fix cuda_graph rng check error
2025-03-11 12:40:42 +08:00
sazc
fcd1dcd99d
Performance: reducing the percentage of FFMA interleaving yields a slight performance gain, roughly 0.5%
2025-03-05 17:50:22 +08:00
Chenggang Zhao
ca13ce0fab
Fix TMA store bugs and code format
2025-02-27 17:57:21 +08:00
Chenggang Zhao
6e55da296f
Fix python -O mode issues
2025-02-27 10:42:46 +08:00
Chenggang Zhao
a6d97a1c1b
Initial commit
2025-02-25 22:52:41 +08:00