Commit Graph

  • 5acb5a9ec5
    Merge 8aff6309d4 into d374456787 Gabriel Wu 2025-04-29 20:28:42 +0800
  • 8aff6309d4 Merge remote-tracking branch 'upstream/main' into nvrtc Zihua Wu 2025-04-29 05:28:07 -0700
  • d374456787 Less stages for small shape K main Chenggang Zhao 2025-04-28 10:36:08 +0800
  • 46c7a1ef36 Simplify remove-unaligned-pipe Chenggang Zhao 2025-04-28 10:12:35 +0800
  • 1e23215eb6 Finish a draft version Chenggang Zhao 2025-04-28 09:49:45 +0800
  • 86afd0c212 Add two more optimization TODOs Chenggang Zhao 2025-04-27 17:51:11 +0800
  • 69852c465f doc: update README Zihua Wu 2025-04-25 18:57:22 -0700
  • d473f594be Merge remote-tracking branch 'upstream/main' into nvrtc Zihua Wu 2025-04-25 18:56:49 -0700
  • f6198492cb feat: drop support for CUDA<12.3 Zihua Wu 2025-04-25 18:56:40 -0700
  • 33e0c3ce40 Update plans Chenggang Zhao 2025-04-24 14:37:53 +0800
  • 46762b6903 feat: make API more general Zihua Wu 2025-04-23 02:34:23 -0700
  • 6c982791eb Merge remote-tracking branch 'upstream/main' into nvrtc Zihua Wu 2025-04-23 00:17:28 -0700
  • 6f0a17cb10
    Merge pull request #1 from lucifer1004/nvrtc-compat Gabriel Wu 2025-04-23 15:06:10 +0800
  • 95e81b3dd6
    Indivisible TMA (#90) yukuai26 2025-04-23 14:55:14 +0800
  • 2d8c4f22d5 fix: windows compat Gabriel Wu 2025-04-23 14:47:15 +0800
  • 6b53c65c04 Optimize performance Chenggang Zhao 2025-04-23 14:34:14 +0800
  • 24517316af Minor update Chenggang Zhao 2025-04-23 14:14:49 +0800
  • d7c068d467 Fix unaligned cases Chenggang Zhao 2025-04-23 14:09:21 +0800
  • 40c09fb883 feat: fix win compat Zihua Wu 2025-04-22 23:01:23 -0700
  • f4b205bfa3 Minor fixes Chenggang Zhao 2025-04-23 13:32:19 +0800
  • a3210ac850 feat: save kernel name to file Zihua Wu 2025-04-22 22:23:47 -0700
  • 767793bf95 feat: compat for old drivers Zihua Wu 2025-04-22 20:42:59 -0700
  • 55d1d01c43 Minor fixes Chenggang Zhao 2025-04-23 11:38:45 +0800
  • 78c7fa347e fix: compiler version Zihua Wu 2025-04-23 00:06:18 +0000
  • c14cad0c06 refactor: compile to .cubin and add NVRTC option Zihua Wu 2025-04-22 10:17:52 +0000
  • 07ef809d82 Optimize performance Chenggang Zhao 2025-04-22 17:48:11 +0800
  • 59884211ea Simplify Chenggang Zhao 2025-04-22 17:37:34 +0800
  • f4014953ad Several code lints x2 Chenggang Zhao 2025-04-22 17:24:02 +0800
  • 902208a17e Several code lints Chenggang Zhao 2025-04-22 16:58:34 +0800
  • 27cd276e19 [wip] refactor: compile to .cubin Zihua Wu 2025-04-22 08:08:40 +0000
  • 80f1cfc630 add notes2 yukuai 2025-04-22 15:56:40 +0800
  • 8a92750027 fix typo yukuai 2025-04-22 15:51:47 +0800
  • bfb2bcc04d add notes yukuai 2025-04-22 15:23:32 +0800
  • d2369e8f30 fix typo2 yukuai 2025-04-22 15:19:29 +0800
  • 69477036c0 fix typo yukuai 2025-04-22 15:03:39 +0800
  • ee4204ad98 tma support indivisible num_n_blocks/num_m_blocks yukuai 2025-04-22 14:35:31 +0800
  • 891f35adf5
    Support TMA multicast on B with m_grouped_gemm_contiguous. (#88) yukuai26 2025-04-21 09:43:17 +0800
  • c99756778d Code polishing x3 Chenggang Zhao 2025-04-21 09:41:13 +0800
  • 594020e6af Code polishing x2 Chenggang Zhao 2025-04-18 16:43:14 +0800
  • a51629ddf9 Code polishing Chenggang Zhao 2025-04-18 16:43:00 +0800
  • 2752a67aad Support TMA multicast on B with m_grouped_gemm_contiguous. yukuai 2025-04-18 15:38:04 +0800
  • 83aa960b9b Fix bugs Chenggang Zhao 2025-04-18 11:55:51 +0800
  • fea9309c1e Update README Chenggang Zhao 2025-04-18 11:38:52 +0800
  • 340d9880f4 Overlap TMA store Chenggang Zhao 2025-04-18 11:18:23 +0800
  • 4499c4ccbb
    Refactor MMA template with CUTLASS (#87) Zhean Xu 2025-04-14 17:06:49 +0800
  • 857d57d157
    Update README.md Zhean Xu 2025-04-14 17:03:35 +0800
  • 584b67eebb Refactor MMA with cutlass Zhean Xu 2025-04-14 16:56:13 +0800
  • 37aa127451
    Use swizzling instead of padding (#86) Chenggang Zhao 2025-04-14 15:20:58 +0800
  • 406c630709 Stricter assertions Chenggang Zhao 2025-04-14 11:55:37 +0800
  • b699750a4a Fix README Chenggang Zhao 2025-04-14 11:33:05 +0800
  • f8797d3c12 Optimize TMA issues Chenggang Zhao 2025-04-14 11:29:28 +0800
  • 6366f5ad1a Optimize expression Chenggang Zhao 2025-04-14 10:26:33 +0800
  • 9406e2a3a1 Optimize swizzle performance Chenggang Zhao 2025-04-14 10:10:46 +0800
  • 5ff0eb24b5 Fix bugs Chenggang Zhao 2025-04-11 18:27:56 +0800
  • 23d4289365 Compatible with padding Chenggang Zhao 2025-04-11 17:38:22 +0800
  • 4c111418a2 Swizzling draft Chenggang Zhao 2025-04-11 17:19:59 +0800
  • 76804c096d Always use STSMx2 Chenggang Zhao 2025-04-11 14:37:37 +0800
  • 93c92c2c89 Add TMA D descriptor Chenggang Zhao 2025-04-11 14:10:11 +0800
  • 6078b25424 Add swizzling params Chenggang Zhao 2025-04-11 13:15:52 +0800
  • 2e7e58011b
    Merge pull request #83 from deepseek-ai/tma-1d-store Chenggang Zhao 2025-04-11 11:25:01 +0800
  • b0d64817a7 OOB bugs fixed Chenggang Zhao 2025-04-11 11:00:47 +0800
  • 99eb6ec563 Remove useless STSM Chenggang Zhao 2025-04-11 10:45:36 +0800
  • 8041ed7164 Use 1D TMA store Chenggang Zhao 2025-04-11 10:42:01 +0800
  • a77009cb14 Make partition pipelined Chenggang Zhao 2025-04-10 18:07:25 +0800
  • 5bda27244b Add CMake support for CLion indexing Chenggang Zhao 2025-04-10 09:52:15 +0800
  • 327ec92f69 Update roadmap Chenggang Zhao 2025-04-09 11:44:30 +0800
  • 677143be64 Update roadmap Chenggang Zhao 2025-04-09 11:41:36 +0800
  • fed3e4d701
    Merge pull request #81 from deepseek-ai/blocktile-256x128 Chenggang Zhao 2025-04-09 11:26:40 +0800
  • 989c9e3694 Update README Chenggang Zhao 2025-04-09 11:17:47 +0800
  • a9967bc27c Update README Chenggang Zhao 2025-04-09 11:14:45 +0800
  • 5a80e4bb96 Fix indent x2 Chenggang Zhao 2025-04-09 11:00:10 +0800
  • bdca8b0624 Fix indent Chenggang Zhao 2025-04-09 10:59:07 +0800
  • 4c0cc290c7 Refactor M repetition with loops Chenggang Zhao 2025-04-09 10:50:44 +0800
  • a6524d411a Larger block N candidates Chenggang Zhao 2025-04-09 10:11:43 +0800
  • 48a5f071be Clean up config heuristics Chenggang Zhao 2025-04-09 10:01:15 +0800
  • ce65d5e33c Remove unused x256 WGMMA Chenggang Zhao 2025-04-09 09:32:46 +0800
  • 97575bf1c6 Performance: BlockTile 256x128 optimizations enable 1500+ TFLOPS FP8 performance on the H800-SXM platform sazc 2025-04-08 17:42:23 +0800
  • b4ecf9c3ff Fix TMA multicast bugs Chenggang Zhao 2025-04-07 14:34:42 +0800
  • bff5724ded Code format Chenggang Zhao 2025-04-07 09:32:43 +0800
  • 3ea3cb203c
    Merge pull request #80 from abcdabcd987/fix-link-error Chenggang Zhao 2025-04-07 09:31:58 +0800
  • b0868c9014
    Merge pull request #79 from yizhang2077/lru-cache-opt Chenggang Zhao 2025-04-07 09:31:30 +0800
  • 611e3f659d Fix linking error from ODR violation Lequn Chen 2025-04-05 17:35:04 +0000
  • 776bd0cccc
    add lru-cache to avoid repeated calculation Yi Zhang 2025-04-04 12:44:26 +0800
  • c187c23ba8
    Merge pull request #78 from deepseek-ai/tma-3d-padding Chenggang Zhao 2025-04-03 16:06:10 +0800
  • d14962f072 Add DG_NVCC_OVERRIDE_CPP_STANDARD Chenggang Zhao 2025-04-03 15:53:29 +0800
  • 3a5539b7db Use c++20 Chenggang Zhao 2025-04-03 15:47:59 +0800
  • 6db7e1863b Solve STSM bank conflict via padding and 3D TMA Chenggang Zhao 2025-04-03 15:39:35 +0800
  • d7ce715118
    Revert "Update nvcc flag c++20" Liang 2025-03-28 10:43:29 +0800
  • c57699ac93
    Merge pull request #76 from YLGH/patch-1 Liang 2025-03-26 09:52:47 +0800
  • b7db15ce94
    Update nvcc flag c++20 YLGH 2025-03-25 14:15:39 -0700
  • 8002b769c0 Update README Chenggang Zhao 2025-03-25 18:13:24 +0800
  • a5645d7afa
    Merge pull request #74 from deepseek-ai/larger-block Chenggang Zhao 2025-03-25 18:07:33 +0800
  • 55ab91f72f Update performance Chenggang Zhao 2025-03-25 18:06:47 +0800
  • 09d097f84d Add some notes Chenggang Zhao 2025-03-25 17:41:49 +0800
  • 25db8de345 Better performance Chenggang Zhao 2025-03-25 17:34:06 +0800
  • 1999d553e5 Lower TMA requirement Chenggang Zhao 2025-03-25 17:18:53 +0800
  • ddccb230ca Fix NVCC branch divergence Chenggang Zhao 2025-03-25 17:12:51 +0800
  • 9c4f6f53f5 Optimize compilation speed Chenggang Zhao 2025-03-25 16:51:21 +0800
  • 612dd57001 Simplify code Chenggang Zhao 2025-03-25 16:45:20 +0800
  • 046fab64b7 Fix grouped GEMM cases Chenggang Zhao 2025-03-25 16:41:44 +0800