Commit Graph

  • 594020e6af Code polishing x2 Chenggang Zhao 2025-04-18 16:43:14 +0800
  • a51629ddf9 Code polishing Chenggang Zhao 2025-04-18 16:43:00 +0800
  • 2752a67aad Support TMA multicast on B with m_grouped_gemm_contiguous. yukuai 2025-04-18 15:38:04 +0800
  • 83aa960b9b Fix bugs Chenggang Zhao 2025-04-18 11:55:51 +0800
  • fea9309c1e Update README Chenggang Zhao 2025-04-18 11:38:52 +0800
  • 340d9880f4 Overlap TMA store Chenggang Zhao 2025-04-18 11:18:23 +0800
  • 4499c4ccbb
    Refactor MMA template with CUTLASS (#87) Zhean Xu 2025-04-14 17:06:49 +0800
  • 857d57d157
    Update README.md Zhean Xu 2025-04-14 17:03:35 +0800
  • 584b67eebb Refactor MMA with cutlass Zhean Xu 2025-04-14 16:56:13 +0800
  • 37aa127451
    Use swizzling instead of padding (#86) Chenggang Zhao 2025-04-14 15:20:58 +0800
  • 406c630709 Stricter assertions Chenggang Zhao 2025-04-14 11:55:37 +0800
  • b699750a4a Fix README Chenggang Zhao 2025-04-14 11:33:05 +0800
  • f8797d3c12 Optimize TMA issues Chenggang Zhao 2025-04-14 11:29:28 +0800
  • 6366f5ad1a Optimize expression Chenggang Zhao 2025-04-14 10:26:33 +0800
  • 9406e2a3a1 Optimize swizzle performance Chenggang Zhao 2025-04-14 10:10:46 +0800
  • 5ff0eb24b5 Fix bugs Chenggang Zhao 2025-04-11 18:27:56 +0800
  • 23d4289365 Compatible with padding Chenggang Zhao 2025-04-11 17:38:22 +0800
  • 4c111418a2 Swizzling draft Chenggang Zhao 2025-04-11 17:19:59 +0800
  • 76804c096d Always use STSMx2 Chenggang Zhao 2025-04-11 14:37:37 +0800
  • 93c92c2c89 Add TMA D descriptor Chenggang Zhao 2025-04-11 14:10:11 +0800
  • 6078b25424 Add swizzling params Chenggang Zhao 2025-04-11 13:15:52 +0800
  • 2e7e58011b
    Merge pull request #83 from deepseek-ai/tma-1d-store Chenggang Zhao 2025-04-11 11:25:01 +0800
  • b0d64817a7 OOB bugs fixed Chenggang Zhao 2025-04-11 11:00:47 +0800
  • 99eb6ec563 Remove useless STSM Chenggang Zhao 2025-04-11 10:45:36 +0800
  • 8041ed7164 Use 1D TMA store Chenggang Zhao 2025-04-11 10:42:01 +0800
  • a77009cb14 Make partition pipelined Chenggang Zhao 2025-04-10 18:07:25 +0800
  • 5bda27244b Add CMake support for CLion indexing Chenggang Zhao 2025-04-10 09:52:15 +0800
  • 327ec92f69 Update roadmap Chenggang Zhao 2025-04-09 11:44:30 +0800
  • 677143be64 Update roadmap Chenggang Zhao 2025-04-09 11:41:36 +0800
  • fed3e4d701
    Merge pull request #81 from deepseek-ai/blocktile-256x128 Chenggang Zhao 2025-04-09 11:26:40 +0800
  • 989c9e3694 Update README Chenggang Zhao 2025-04-09 11:17:47 +0800
  • a9967bc27c Update README Chenggang Zhao 2025-04-09 11:14:45 +0800
  • 5a80e4bb96 Fix indent x2 Chenggang Zhao 2025-04-09 11:00:10 +0800
  • bdca8b0624 Fix indent Chenggang Zhao 2025-04-09 10:59:07 +0800
  • 4c0cc290c7 Refactor M repetition with loops Chenggang Zhao 2025-04-09 10:50:44 +0800
  • a6524d411a Larger block N candidates Chenggang Zhao 2025-04-09 10:11:43 +0800
  • 48a5f071be Clean up config heuristics Chenggang Zhao 2025-04-09 10:01:15 +0800
  • ce65d5e33c Remove unused x256 WGMMA Chenggang Zhao 2025-04-09 09:32:46 +0800
  • 97575bf1c6 Performance: BlockTile 256x128 optimizations enable 1500+ TFLOPS FP8 performance on the H800-SXM platform sazc 2025-04-08 17:42:23 +0800
  • b4ecf9c3ff Fix TMA multicast bugs Chenggang Zhao 2025-04-07 14:34:42 +0800
  • bff5724ded Code format Chenggang Zhao 2025-04-07 09:32:43 +0800
  • 3ea3cb203c
    Merge pull request #80 from abcdabcd987/fix-link-error Chenggang Zhao 2025-04-07 09:31:58 +0800
  • b0868c9014
    Merge pull request #79 from yizhang2077/lru-cache-opt Chenggang Zhao 2025-04-07 09:31:30 +0800
  • 611e3f659d Fix linking error from ODR violation Lequn Chen 2025-04-05 17:35:04 +0000
  • 776bd0cccc
    add lru-cache to avoid repeated calculation Yi Zhang 2025-04-04 12:44:26 +0800
  • c187c23ba8
    Merge pull request #78 from deepseek-ai/tma-3d-padding Chenggang Zhao 2025-04-03 16:06:10 +0800
  • d14962f072 Add DG_NVCC_OVERRIDE_CPP_STANDARD Chenggang Zhao 2025-04-03 15:53:29 +0800
  • 3a5539b7db Use c++20 Chenggang Zhao 2025-04-03 15:47:59 +0800
  • 6db7e1863b Solve STSM bank conflict via padding and 3D TMA Chenggang Zhao 2025-04-03 15:39:35 +0800
  • d7ce715118
    Revert "Update nvcc flag c++20" Liang 2025-03-28 10:43:29 +0800
  • c57699ac93
    Merge pull request #76 from YLGH/patch-1 Liang 2025-03-26 09:52:47 +0800
  • b7db15ce94
    Update nvcc flag c++20 YLGH 2025-03-25 14:15:39 -0700
  • 8002b769c0 Update README Chenggang Zhao 2025-03-25 18:13:24 +0800
  • a5645d7afa
    Merge pull request #74 from deepseek-ai/larger-block Chenggang Zhao 2025-03-25 18:07:33 +0800
  • 55ab91f72f Update performance Chenggang Zhao 2025-03-25 18:06:47 +0800
  • 09d097f84d Add some notes Chenggang Zhao 2025-03-25 17:41:49 +0800
  • 25db8de345 Better performance Chenggang Zhao 2025-03-25 17:34:06 +0800
  • 1999d553e5 Lower TMA requirement Chenggang Zhao 2025-03-25 17:18:53 +0800
  • ddccb230ca Fix NVCC branch divergence Chenggang Zhao 2025-03-25 17:12:51 +0800
  • 9c4f6f53f5 Optimize compilation speed Chenggang Zhao 2025-03-25 16:51:21 +0800
  • 612dd57001 Simplify code Chenggang Zhao 2025-03-25 16:45:20 +0800
  • 046fab64b7 Fix grouped GEMM cases Chenggang Zhao 2025-03-25 16:41:44 +0800
  • 7768319ffe Remove unaligned predicates Chenggang Zhao 2025-03-25 16:32:40 +0800
  • 3497428a5e Minor fix Chenggang Zhao 2025-03-25 15:16:26 +0800
  • 7ffb118e54 Support multicasting on B Chenggang Zhao 2025-03-25 14:56:42 +0800
  • 742fb1c8a5 Compilation-time GCD Chenggang Zhao 2025-03-25 13:41:28 +0800
  • b922e64cb2 Support block size 160 Chenggang Zhao 2025-03-25 13:37:59 +0800
  • 46eb0d08fb Performance: Larger BlockTile optimizations enable 1470+ TFLOPS FP8 performance on the H800-SXM platform sazc 2025-03-25 10:44:57 +0800
  • 3b3783d06c
    Merge pull request #68 from ademeure/flush_l2_pr Liang 2025-03-16 09:16:34 +0800
  • 6cbff5778f Correctly flush L2, as reconstructing the tensors on every iteration effectively put them in the L2, and gave the GPU enough idle time to avoid thermal throttling in a potentially unrealistic way. ademeure 2025-03-15 20:46:24 +0000
  • e1c070fbef
    Merge pull request #65 from Z-NAVY/main Liang 2025-03-14 13:50:08 +0800
  • 2f6af9d736 Adding Dockerfile to gitignore Derek Rosenzweig 2025-03-13 20:04:48 -0700
  • eb8e8346c8 Add CSV benchmark results saving feature Derek Rosenzweig 2025-03-13 20:02:24 -0700
  • 4377c4dc57
    Merge pull request #63 from fzyzcjy/patch-2 Liang 2025-03-14 10:27:48 +0800
  • 3f92607b98 Fix get_col_major_tma_aligned_tensor to handle 2-dimensional inputs z-navy 2025-03-13 22:15:16 +0800
  • e7fff7ef0a
    Update m_grouped_gemm.py fzyzcjy 2025-03-13 22:09:15 +0800
  • cf640558af
    Update fp8_gemm.cuh fzyzcjy 2025-03-13 21:02:52 +0800
  • bcd5fe8271
    Merge pull request #2 from yukavio/add_backward_w yukavio 2025-03-13 15:06:09 +0800
  • 094d0421ec refine kavioyu 2025-03-13 07:04:56 +0000
  • e22d67d451
    Merge pull request #1 from yukavio/add_backward_w yukavio 2025-03-13 10:14:50 +0800
  • 6e53c6613d tested kavioyu 2025-03-12 13:48:02 +0000
  • bd2a775528 Code format Chenggang Zhao 2025-03-11 13:26:10 +0800
  • 5233bad1e9
    Merge pull request #55 from sleepcoo/fix-cudagraph Chenggang Zhao 2025-03-11 13:25:35 +0800
  • 723a00338e fix cuda_graph rng check error sleepcoo 2025-03-11 12:40:42 +0800
  • 5b9bfa6057 fix 2D path YLGH 2025-03-10 17:09:05 +0000
  • 5e4badc577 Fix type lint Chenggang Zhao 2025-03-10 13:10:16 +0800
  • ba1e93a5c7
    Merge pull request #44 from sazczmh/main Chenggang Zhao 2025-03-10 13:08:03 +0800
  • bed67b234c Minor fix sazc 2025-03-10 13:02:02 +0800
  • ed278eddd3 formats: Optimize get_best_configs implementation sazc 2025-03-10 12:56:14 +0800
  • 50cf26cc7c Performance: Configuration algorithms tuned to minimize the impact of tail effects, now up to 1402 TFLOPS sazc 2025-03-10 11:45:05 +0800
  • 39c10e6c31 Revert "Merge pull request #49 from A-transformer/maximum_fp8_e4m3_value" Chenggang Zhao 2025-03-10 09:47:02 +0800
  • 4d4f2342fe
    Merge pull request #49 from A-transformer/maximum_fp8_e4m3_value Liang 2025-03-08 09:50:06 +0800
  • 629857685e
    Maximum representable value in FP8 E4M3 format A-transformer 2025-03-07 19:58:02 +0400
  • 82d4b996f7
    Add TMA optimization hint for large FFMA segments A-transformer 2025-03-07 19:37:55 +0400
  • de1603d5e4 fix typo dxh 2025-03-07 11:42:51 +0800
  • 0b5d353dba tensor alignment fix dxh 2025-03-07 10:58:37 +0800
  • ba142b891a config cache and tensor alignment fix dxh 2025-03-07 10:54:14 +0800
  • dcf8ec2041
    add assert exists(src_dir) A-transformer 2025-03-06 20:34:59 +0400
  • 9d3222a93e
    Merge pull request #42 from sazczmh/main Liang 2025-03-05 20:09:56 +0800
  • fcd1dcd99d Performance: reducing the percentage of FFMA interleaving yields a slight performance gain, roughly 0.5% sazc 2025-03-05 17:50:22 +0800