Commit Graph

  • b34733898a
    Merge e29e996a42 into e82c4139da Wangzheee 2025-06-24 18:53:27 +0800
  • e29e996a42 update unitest wangzhe_ant 2025-06-24 18:53:19 +0800
  • 7db1b0ef63 update unitest wangzhe_ant 2025-06-24 18:24:08 +0800
  • ccd63bb234 fix tma_d_offset_desc_swapAB, update unitest wangzhe_ant 2025-06-24 17:52:28 +0800
  • cc17efd000
    Merge 93ea4797c0 into e82c4139da celsowm 2025-06-23 21:31:03 -0300
  • 93ea4797c0 Add initial support for Nvidia Blackwell (SM120) google-labs-jules[bot] 2025-06-24 00:30:35 +0000
  • e82c4139da Revert "Fixed the bug in get_swizzle_mode function related to elem_size setting. (#115)" main yukuai 2025-06-23 17:13:36 +0800
  • ac428e25e0
    Fixed the bug in get_swizzle_mode function related to elem_size setting. (#115) TherLF 2025-06-23 09:37:10 +0800
  • 26a603f518 fix some bug Wangzheee 2025-06-20 06:53:24 +0000
  • d29b20cd16 support group_gemm_offset, group_gemm_offset_swapAB Wangzheee 2025-06-19 14:51:38 +0000
  • 88a2b6c490 Fixed the bug in get_swizzle_mode function related to elem_size setting. Ther-LF 2025-06-17 08:49:36 +0000
  • 0c88cd0139
    Fix illegal memory address when skipping -1 m indices (#113) shixianc 2025-06-15 19:44:31 -0700
  • b7ccb135bc Fix illegal memory address when skipping -1 m indices Shixian Cui 2025-06-15 01:26:58 +0000
  • dd4bf2e620
    Merge cc416ee4fa into 8dfa329827 Ray Wang 2025-06-12 17:12:50 +0800
  • cc416ee4fa Update layout.py fzyzcjy 2025-06-12 16:10:00 +0800
  • a437e0b1ca Add more GPU architectures support Ray Wang 2025-06-11 18:56:20 -0700
  • 8dfa329827
    Grouped GEMM skip useless computation for unaligned Ms (#103) yukuai26 2025-05-27 13:43:38 +0800
  • b3726b423c Minor fix Chenggang Zhao 2025-05-27 13:41:44 +0800
  • 3bd234e79c Add assertions Chenggang Zhao 2025-05-27 13:21:19 +0800
  • 780b4098e4 Fix tests Chenggang Zhao 2025-05-27 13:15:26 +0800
  • 81de208430 Format Chenggang Zhao 2025-05-27 12:00:10 +0800
  • 81f906ef76 Bug fixed Chenggang Zhao 2025-05-27 11:58:08 +0800
  • a5373e4bbd Add SkipComputation types Chenggang Zhao 2025-05-27 11:57:25 +0800
  • 1169f83c36 Refactor tests Chenggang Zhao 2025-05-27 11:32:05 +0800
  • c4e31d121b Format Chenggang Zhao 2025-05-27 11:14:13 +0800
  • 2c5ab83c6c Restore previous indent Chenggang Zhao 2025-05-27 11:11:46 +0800
  • e7e38ed222 Rename variables yukuai 2025-05-20 16:40:56 +0800
  • dc89674f47 small typo yukuai 2025-05-20 16:20:50 +0800
  • db1f0b5a1c Update readme.md yukuai 2025-05-20 15:59:18 +0800
  • ccca476ac4 Grouped GEMM skip useless computation for unaligned Ms yukuai 2025-05-20 15:31:30 +0800
  • 391755ada0 Fix JIT tests Chenggang Zhao 2025-05-16 14:39:58 +0800
  • 78d8362e7a Add a missing #pragma once Chenggang Zhao 2025-05-15 18:10:05 +0800
  • ec426b9d66
    Merge pull request #100 from deepseek-ai/remove-tuner Chenggang Zhao 2025-05-15 17:05:42 +0800
  • 104a6ec109 Add __assertfail Chenggang Zhao 2025-05-15 17:04:21 +0800
  • 3b412f458a Unify kwargs usages Chenggang Zhao 2025-05-15 16:53:52 +0800
  • 350989eef3 Unify ceil_divs Chenggang Zhao 2025-05-15 16:48:32 +0800
  • 4373af2e82 Add DG_PRINT_CONFIGS Chenggang Zhao 2025-05-15 16:36:40 +0800
  • 816b39053a Refactor launch-related structures Chenggang Zhao 2025-05-15 16:14:21 +0800
  • 09a19dcddd fix cassanof 2025-05-14 21:38:03 -0700
  • e2d6a107ef Cleanup some useless staffs Chenggang Zhao 2025-05-14 15:46:45 +0800
  • ebf3d2f916 Update plans Chenggang Zhao 2025-05-14 15:05:24 +0800
  • 04278f6dee
    Weight gradient kernels for dense and MoE models (#95) Zhean Xu 2025-05-14 14:47:58 +0800
  • a6ced6f207 Add stride(0) assertions Zhean Xu 2025-05-14 14:41:39 +0800
  • 279eb03190 Remove restrictions on N Chenggang Zhao 2025-05-14 14:27:04 +0800
  • c4a7116e0a Several cleanups Chenggang Zhao 2025-05-14 14:18:43 +0800
  • 6233709c67 Update docs Zhean Xu 2025-05-13 15:30:32 +0800
  • 919f55be9c Support unaligned n,k and gmem stride Zhean Xu 2025-05-09 12:55:46 +0800
  • adf5de0244 Merge branch 'main' into wgrad-gemm Zhean Xu 2025-05-09 12:30:13 +0800
  • d75b218b7b Update README with NVRTC news Chenggang Zhao 2025-05-07 13:26:58 +0800
  • 8702f910e3 Fix 12.9 compatibility Chenggang Zhao 2025-05-07 13:23:40 +0800
  • 085b4a1532 Add DG_PRINT_AUTOTUNE to README Chenggang Zhao 2025-05-07 11:46:52 +0800
  • daec8fd2fc Fix pipeline stage edge cases Chenggang Zhao 2025-05-07 11:40:34 +0800
  • bfe983c4c2
    Refactor JIT compilation (+NVRTC support) (#94) Gabriel Wu 2025-05-07 11:38:14 +0800
  • 3f22f81326 Add a TODO Chenggang Zhao 2025-05-07 11:30:57 +0800
  • 41b4cff7c8 Drop support for Windows Chenggang Zhao 2025-05-07 11:24:35 +0800
  • a29b331c48 Fix typing Chenggang Zhao 2025-05-07 11:22:02 +0800
  • c5fa9f1068 Fix indent Chenggang Zhao 2025-05-07 11:19:55 +0800
  • ba349d9cf8 Compatible with CUDA 12.3 Chenggang Zhao 2025-05-07 11:15:19 +0800
  • 5373da7b28 Add a TODO Chenggang Zhao 2025-05-07 10:15:27 +0800
  • 159ba93ab3 Code format Chenggang Zhao 2025-05-07 10:13:19 +0800
  • 5272d40aaf Refactor environment variables Chenggang Zhao 2025-05-07 10:09:01 +0800
  • 83f6e9537e Several fixes Chenggang Zhao 2025-05-07 09:57:39 +0800
  • 317e83581d Refactor runtime Chenggang Zhao 2025-05-06 17:45:42 +0800
  • 981cc58932 Some lints and refactor Chenggang Zhao 2025-05-06 17:23:35 +0800
  • d5470d3b4e Init weight gradient kernels. Zhean Xu 2025-05-06 17:16:27 +0800
  • 8aff6309d4 Merge remote-tracking branch 'upstream/main' into nvrtc Zihua Wu 2025-04-29 05:28:07 -0700
  • d374456787 Less stages for small shape K Chenggang Zhao 2025-04-28 10:36:08 +0800
  • 86afd0c212 Add two more optimization TODOs Chenggang Zhao 2025-04-27 17:51:11 +0800
  • 69852c465f doc: update README Zihua Wu 2025-04-25 18:57:22 -0700
  • d473f594be Merge remote-tracking branch 'upstream/main' into nvrtc Zihua Wu 2025-04-25 18:56:49 -0700
  • f6198492cb feat: drop support for CUDA<12.3 Zihua Wu 2025-04-25 18:56:40 -0700
  • 33e0c3ce40 Update plans Chenggang Zhao 2025-04-24 14:37:53 +0800
  • 46762b6903 feat: make API more general Zihua Wu 2025-04-23 02:34:23 -0700
  • 6c982791eb Merge remote-tracking branch 'upstream/main' into nvrtc Zihua Wu 2025-04-23 00:17:28 -0700
  • 6f0a17cb10
    Merge pull request #1 from lucifer1004/nvrtc-compat Gabriel Wu 2025-04-23 15:06:10 +0800
  • 95e81b3dd6
    Indivisible TMA (#90) yukuai26 2025-04-23 14:55:14 +0800
  • 2d8c4f22d5 fix: windows compat Gabriel Wu 2025-04-23 14:47:15 +0800
  • 6b53c65c04 Optimize performance Chenggang Zhao 2025-04-23 14:34:14 +0800
  • 24517316af Minor update Chenggang Zhao 2025-04-23 14:14:49 +0800
  • d7c068d467 Fix unaligned cases Chenggang Zhao 2025-04-23 14:09:21 +0800
  • 40c09fb883 feat: fix win compat Zihua Wu 2025-04-22 23:01:23 -0700
  • f4b205bfa3 Minor fixes Chenggang Zhao 2025-04-23 13:32:19 +0800
  • a3210ac850 feat: save kernel name to file Zihua Wu 2025-04-22 22:23:47 -0700
  • 767793bf95 feat: compat for old drivers Zihua Wu 2025-04-22 20:42:59 -0700
  • 55d1d01c43 Minor fixes Chenggang Zhao 2025-04-23 11:38:45 +0800
  • 78c7fa347e fix: compiler version Zihua Wu 2025-04-23 00:06:18 +0000
  • c14cad0c06 refactor: compile to .cubin and add NVRTC option Zihua Wu 2025-04-22 10:17:52 +0000
  • 07ef809d82 Optimize performance Chenggang Zhao 2025-04-22 17:48:11 +0800
  • 59884211ea Simplify Chenggang Zhao 2025-04-22 17:37:34 +0800
  • f4014953ad Several code lints x2 Chenggang Zhao 2025-04-22 17:24:02 +0800
  • 902208a17e Several code lints Chenggang Zhao 2025-04-22 16:58:34 +0800
  • 27cd276e19 [wip] refactor: compile to .cubin Zihua Wu 2025-04-22 08:08:40 +0000
  • 80f1cfc630 add notes2 yukuai 2025-04-22 15:56:40 +0800
  • 8a92750027 fix typo yukuai 2025-04-22 15:51:47 +0800
  • bfb2bcc04d add notes yukuai 2025-04-22 15:23:32 +0800
  • d2369e8f30 fix typo2 yukuai 2025-04-22 15:19:29 +0800
  • 69477036c0 fix typo yukuai 2025-04-22 15:03:39 +0800
  • ee4204ad98 tma support indivisible num_n_blocks/num_m_blocks yukuai 2025-04-22 14:35:31 +0800
  • 891f35adf5
    Support TMA multicast on B with m_grouped_gemm_contiguous. (#88) yukuai26 2025-04-21 09:43:17 +0800
  • c99756778d Code polishing x3 Chenggang Zhao 2025-04-21 09:41:13 +0800