Commit Graph

  • e82c4139da Revert "Fixed the bug in get_swizzle_mode function related to elem_size setting. (#115)" main yukuai 2025-06-23 17:13:36 +08:00
  • ac428e25e0 Fixed the bug in get_swizzle_mode function related to elem_size setting. (#115) TherLF 2025-06-23 09:37:10 +08:00
  • 0c88cd0139 Fix illegal memory address when skipping -1 m indices (#113) shixianc 2025-06-15 19:44:31 -07:00
  • 8dfa329827 Grouped GEMM skip useless computation for unaligned Ms (#103) yukuai26 2025-05-27 13:43:38 +08:00
  • 391755ada0 Fix JIT tests Chenggang Zhao 2025-05-16 14:39:58 +08:00
  • 78d8362e7a Add a missing #pragma once Chenggang Zhao 2025-05-15 18:10:05 +08:00
  • ec426b9d66 Merge pull request #100 from deepseek-ai/remove-tuner Chenggang Zhao 2025-05-15 17:05:42 +08:00
  • 104a6ec109 Add __assertfail Chenggang Zhao 2025-05-15 17:04:21 +08:00
  • 3b412f458a Unify kwargs usages Chenggang Zhao 2025-05-15 16:53:52 +08:00
  • 350989eef3 Unify ceil_divs Chenggang Zhao 2025-05-15 16:48:32 +08:00
  • 4373af2e82 Add DG_PRINT_CONFIGS Chenggang Zhao 2025-05-15 16:36:40 +08:00
  • 816b39053a Refactor launch-related structures Chenggang Zhao 2025-05-15 16:14:21 +08:00
  • e2d6a107ef Cleanup some useless staffs Chenggang Zhao 2025-05-14 15:46:45 +08:00
  • ebf3d2f916 Update plans Chenggang Zhao 2025-05-14 15:05:24 +08:00
  • 04278f6dee Weight gradient kernels for dense and MoE models (#95) Zhean Xu 2025-05-14 14:47:58 +08:00
  • d75b218b7b Update README with NVRTC news Chenggang Zhao 2025-05-07 13:26:58 +08:00
  • 8702f910e3 Fix 12.9 compatibility Chenggang Zhao 2025-05-07 13:23:40 +08:00
  • 085b4a1532 Add DG_PRINT_AUTOTUNE to README Chenggang Zhao 2025-05-07 11:46:52 +08:00
  • daec8fd2fc Fix pipeline stage edge cases Chenggang Zhao 2025-05-07 11:40:34 +08:00
  • bfe983c4c2 Refactor JIT compilation (+NVRTC support) (#94) Gabriel Wu 2025-05-07 11:38:14 +08:00
  • d374456787 Less stages for small shape K Chenggang Zhao 2025-04-28 10:36:08 +08:00
  • 86afd0c212 Add two more optimization TODOs Chenggang Zhao 2025-04-27 17:51:11 +08:00
  • 33e0c3ce40 Update plans Chenggang Zhao 2025-04-24 14:37:53 +08:00
  • 95e81b3dd6 Indivisible TMA (#90) yukuai26 2025-04-23 14:55:14 +08:00
  • 891f35adf5 Support TMA multicast on B with m_grouped_gemm_contiguous. (#88) yukuai26 2025-04-21 09:43:17 +08:00
  • 83aa960b9b Fix bugs Chenggang Zhao 2025-04-18 11:55:51 +08:00
  • fea9309c1e Update README Chenggang Zhao 2025-04-18 11:38:52 +08:00
  • 340d9880f4 Overlap TMA store Chenggang Zhao 2025-04-18 11:18:23 +08:00
  • 4499c4ccbb Refactor MMA template with CUTLASS (#87) Zhean Xu 2025-04-14 17:06:49 +08:00
  • 37aa127451 Use swizzling instead of padding (#86) Chenggang Zhao 2025-04-14 15:20:58 +08:00
  • 2e7e58011b Merge pull request #83 from deepseek-ai/tma-1d-store Chenggang Zhao 2025-04-11 11:25:01 +08:00
  • b0d64817a7 OOB bugs fixed Chenggang Zhao 2025-04-11 11:00:47 +08:00
  • 99eb6ec563 Remove useless STSM Chenggang Zhao 2025-04-11 10:45:36 +08:00
  • 8041ed7164 Use 1D TMA store Chenggang Zhao 2025-04-11 10:42:01 +08:00
  • a77009cb14 Make partition pipelined Chenggang Zhao 2025-04-10 18:07:25 +08:00
  • 5bda27244b Add CMake support for CLion indexing Chenggang Zhao 2025-04-10 09:52:15 +08:00
  • 327ec92f69 Update roadmap Chenggang Zhao 2025-04-09 11:44:30 +08:00
  • 677143be64 Update roadmap Chenggang Zhao 2025-04-09 11:41:36 +08:00
  • fed3e4d701 Merge pull request #81 from deepseek-ai/blocktile-256x128 Chenggang Zhao 2025-04-09 11:26:40 +08:00
  • 989c9e3694 Update README Chenggang Zhao 2025-04-09 11:17:47 +08:00
  • a9967bc27c Update README Chenggang Zhao 2025-04-09 11:14:45 +08:00
  • 5a80e4bb96 Fix indent x2 Chenggang Zhao 2025-04-09 11:00:10 +08:00
  • bdca8b0624 Fix indent Chenggang Zhao 2025-04-09 10:59:07 +08:00
  • 4c0cc290c7 Refactor M repetition with loops Chenggang Zhao 2025-04-09 10:50:44 +08:00
  • a6524d411a Larger block N candidates Chenggang Zhao 2025-04-09 10:11:43 +08:00
  • 48a5f071be Clean up config heuristics Chenggang Zhao 2025-04-09 10:01:15 +08:00
  • ce65d5e33c Remove unused x256 WGMMA Chenggang Zhao 2025-04-09 09:32:46 +08:00
  • 97575bf1c6 Performance: BlockTile 256x128 optimizations enable 1500+ TFLOPS FP8 performance on the H800-SXM platform sazc 2025-04-08 17:42:23 +08:00
  • b4ecf9c3ff Fix TMA multicast bugs Chenggang Zhao 2025-04-07 14:34:42 +08:00
  • bff5724ded Code format Chenggang Zhao 2025-04-07 09:32:43 +08:00
  • 3ea3cb203c Merge pull request #80 from abcdabcd987/fix-link-error Chenggang Zhao 2025-04-07 09:31:58 +08:00
  • b0868c9014 Merge pull request #79 from yizhang2077/lru-cache-opt Chenggang Zhao 2025-04-07 09:31:30 +08:00
  • 611e3f659d Fix linking error from ODR violation Lequn Chen 2025-04-05 17:35:04 +00:00
  • 776bd0cccc add lru-cache to avoid repeated calculation Yi Zhang 2025-04-04 12:44:26 +08:00
  • c187c23ba8 Merge pull request #78 from deepseek-ai/tma-3d-padding Chenggang Zhao 2025-04-03 16:06:10 +08:00
  • d14962f072 Add DG_NVCC_OVERRIDE_CPP_STANDARD Chenggang Zhao 2025-04-03 15:53:29 +08:00
  • 3a5539b7db Use c++20 Chenggang Zhao 2025-04-03 15:47:59 +08:00
  • 6db7e1863b Solve STSM bank conflict via padding and 3D TMA Chenggang Zhao 2025-04-03 15:39:35 +08:00
  • c57699ac93 Merge pull request #76 from YLGH/patch-1 Liang 2025-03-26 09:52:47 +08:00
  • b7db15ce94 Update nvcc flag c++20 YLGH 2025-03-25 14:15:39 -07:00
  • 8002b769c0 Update README Chenggang Zhao 2025-03-25 18:13:24 +08:00
  • a5645d7afa Merge pull request #74 from deepseek-ai/larger-block Chenggang Zhao 2025-03-25 18:07:33 +08:00
  • 55ab91f72f Update performance Chenggang Zhao 2025-03-25 18:06:47 +08:00
  • 09d097f84d Add some notes Chenggang Zhao 2025-03-25 17:41:49 +08:00
  • 25db8de345 Better performance Chenggang Zhao 2025-03-25 17:34:06 +08:00
  • 1999d553e5 Lower TMA requirement Chenggang Zhao 2025-03-25 17:18:53 +08:00
  • ddccb230ca Fix NVCC branch divergence Chenggang Zhao 2025-03-25 17:12:51 +08:00
  • 9c4f6f53f5 Optimize compilation speed Chenggang Zhao 2025-03-25 16:51:21 +08:00
  • 612dd57001 Simplify code Chenggang Zhao 2025-03-25 16:45:20 +08:00
  • 046fab64b7 Fix grouped GEMM cases Chenggang Zhao 2025-03-25 16:41:44 +08:00
  • 7768319ffe Remove unaligned predicates Chenggang Zhao 2025-03-25 16:32:40 +08:00
  • 3497428a5e Minor fix Chenggang Zhao 2025-03-25 15:16:26 +08:00
  • 7ffb118e54 Support multicasting on B Chenggang Zhao 2025-03-25 14:56:42 +08:00
  • 742fb1c8a5 Compilation-time GCD Chenggang Zhao 2025-03-25 13:41:28 +08:00
  • b922e64cb2 Support block size 160 Chenggang Zhao 2025-03-25 13:37:59 +08:00
  • 46eb0d08fb Performance: Larger BlockTile optimizations enable 1470+ TFLOPS FP8 performance on the H800-SXM platform sazc 2025-03-25 10:44:57 +08:00
  • 3b3783d06c Merge pull request #68 from ademeure/flush_l2_pr Liang 2025-03-16 09:16:34 +08:00
  • 6cbff5778f Correctly flush L2, as reconstructing the tensors on every iteration effectively put them in the L2, and gave the GPU enough idle time to avoid thermal throttling in a potentially unrealistic way. ademeure 2025-03-15 20:46:24 +00:00
  • e1c070fbef Merge pull request #65 from Z-NAVY/main Liang 2025-03-14 13:50:08 +08:00
  • 4377c4dc57 Merge pull request #63 from fzyzcjy/patch-2 Liang 2025-03-14 10:27:48 +08:00
  • 3f92607b98 Fix get_col_major_tma_aligned_tensor to handle 2-dimensional inputs z-navy 2025-03-13 22:15:16 +08:00
  • e7fff7ef0a Update m_grouped_gemm.py fzyzcjy 2025-03-13 22:09:15 +08:00
  • bd2a775528 Code format Chenggang Zhao 2025-03-11 13:26:10 +08:00
  • 5233bad1e9 Merge pull request #55 from sleepcoo/fix-cudagraph Chenggang Zhao 2025-03-11 13:25:35 +08:00
  • 723a00338e fix cuda_graph rng check error sleepcoo 2025-03-11 12:40:42 +08:00
  • 5e4badc577 Fix type lint Chenggang Zhao 2025-03-10 13:10:16 +08:00
  • ba1e93a5c7 Merge pull request #44 from sazczmh/main Chenggang Zhao 2025-03-10 13:08:03 +08:00
  • bed67b234c Minor fix sazc 2025-03-10 13:02:02 +08:00
  • ed278eddd3 formats: Optimize get_best_configs implementation sazc 2025-03-10 12:56:14 +08:00
  • 50cf26cc7c Performance: Configuration algorithms tuned to minimize the impact of tail effects, now up to 1402 TFLOPS sazc 2025-03-10 11:45:05 +08:00
  • 39c10e6c31 Revert "Merge pull request #49 from A-transformer/maximum_fp8_e4m3_value" Chenggang Zhao 2025-03-10 09:47:02 +08:00
  • 4d4f2342fe Merge pull request #49 from A-transformer/maximum_fp8_e4m3_value Liang 2025-03-08 09:50:06 +08:00
  • 629857685e Maximum representable value in FP8 E4M3 format A-transformer 2025-03-07 19:58:02 +04:00
  • 9d3222a93e Merge pull request #42 from sazczmh/main Liang 2025-03-05 20:09:56 +08:00
  • fcd1dcd99d Performance: reducing the percentage of FFMA interleaving yields a slight performance gain, roughly 0.5% sazc 2025-03-05 17:50:22 +08:00
  • 68fc742572 Merge pull request #36 from yz-tang/fix_setup_build Chenggang Zhao 2025-03-04 17:09:50 +08:00
  • 6c59e0f40d fix setup build error when setuptools version is lower yz-tang 2025-03-04 16:53:00 +08:00
  • 9b0dad8640 Add some notes for promotion Chenggang Zhao 2025-03-04 11:42:20 +08:00
  • ded740f736 Fix documentation of m_grouped_gemm_fp8_fp8_bf16_nt_contiguous in m_grouped_gemm.py Liang 2025-03-04 11:26:23 +08:00
  • dff6bb6f0b Add some notes Chenggang Zhao 2025-03-03 11:35:52 +08:00