Commit Graph

  • d7ce715118
    Revert "Update nvcc flag c++20" revert-76-patch-1 Liang 2025-03-28 10:43:29 +0800
  • c57699ac93
    Merge pull request #76 from YLGH/patch-1 main Liang 2025-03-26 09:52:47 +0800
  • b7db15ce94
    Update nvcc flag c++20 YLGH 2025-03-25 14:15:39 -0700
  • 8002b769c0 Update README Chenggang Zhao 2025-03-25 18:13:24 +0800
  • a5645d7afa
    Merge pull request #74 from deepseek-ai/larger-block Chenggang Zhao 2025-03-25 18:07:33 +0800
  • 55ab91f72f Update performance larger-block Chenggang Zhao 2025-03-25 18:06:47 +0800
  • 09d097f84d Add some notes Chenggang Zhao 2025-03-25 17:41:49 +0800
  • 25db8de345 Better performance Chenggang Zhao 2025-03-25 17:34:06 +0800
  • 1999d553e5 Lower TMA requirement Chenggang Zhao 2025-03-25 17:18:53 +0800
  • ddccb230ca Fix NVCC branch divergence Chenggang Zhao 2025-03-25 17:12:51 +0800
  • 9c4f6f53f5 Optimize compilation speed Chenggang Zhao 2025-03-25 16:51:21 +0800
  • 612dd57001 Simplify code Chenggang Zhao 2025-03-25 16:45:20 +0800
  • 046fab64b7 Fix grouped GEMM cases Chenggang Zhao 2025-03-25 16:41:44 +0800
  • 7768319ffe Remove unaligned predicates Chenggang Zhao 2025-03-25 16:32:40 +0800
  • 3497428a5e Minor fix Chenggang Zhao 2025-03-25 15:16:26 +0800
  • 7ffb118e54 Support multicasting on B Chenggang Zhao 2025-03-25 14:56:42 +0800
  • 742fb1c8a5 Compilation-time GCD Chenggang Zhao 2025-03-25 13:41:28 +0800
  • b922e64cb2 Support block size 160 Chenggang Zhao 2025-03-25 13:37:59 +0800
  • 46eb0d08fb Performance: Larger BlockTile optimizations enable 1470+ TFLOPS FP8 performance on the H800-SXM platform sazc 2025-03-25 10:44:57 +0800
  • 3b3783d06c
    Merge pull request #68 from ademeure/flush_l2_pr Liang 2025-03-16 09:16:34 +0800
  • 6cbff5778f Correctly flush L2, as reconstructing the tensors on every iteration effectively put them in the L2, and gave the GPU enough idle time to avoid thermal throttling in a potentially unrealistic way. ademeure 2025-03-15 20:46:24 +0000
  • e1c070fbef
    Merge pull request #65 from Z-NAVY/main Liang 2025-03-14 13:50:08 +0800
  • 2f6af9d736 Adding Dockerfile to gitignore Derek Rosenzweig 2025-03-13 20:04:48 -0700
  • eb8e8346c8 Add CSV benchmark results saving feature Derek Rosenzweig 2025-03-13 20:02:24 -0700
  • 4377c4dc57
    Merge pull request #63 from fzyzcjy/patch-2 Liang 2025-03-14 10:27:48 +0800
  • 3f92607b98 Fix get_col_major_tma_aligned_tensor to handle 2-dimensional inputs z-navy 2025-03-13 22:15:16 +0800
  • e7fff7ef0a
    Update m_grouped_gemm.py fzyzcjy 2025-03-13 22:09:15 +0800
  • cf640558af
    Update fp8_gemm.cuh fzyzcjy 2025-03-13 21:02:52 +0800
  • bcd5fe8271
    Merge pull request #2 from yukavio/add_backward_w yukavio 2025-03-13 15:06:09 +0800
  • 094d0421ec refine kavioyu 2025-03-13 07:04:56 +0000
  • e22d67d451
    Merge pull request #1 from yukavio/add_backward_w yukavio 2025-03-13 10:14:50 +0800
  • 6e53c6613d tested kavioyu 2025-03-12 13:48:02 +0000
  • bd2a775528 Code format Chenggang Zhao 2025-03-11 13:26:10 +0800
  • 5233bad1e9
    Merge pull request #55 from sleepcoo/fix-cudagraph Chenggang Zhao 2025-03-11 13:25:35 +0800
  • 723a00338e fix cuda_graph rng check error sleepcoo 2025-03-11 12:40:42 +0800
  • 5b9bfa6057 fix 2D path YLGH 2025-03-10 17:09:05 +0000
  • 5e4badc577 Fix type lint Chenggang Zhao 2025-03-10 13:10:16 +0800
  • ba1e93a5c7
    Merge pull request #44 from sazczmh/main Chenggang Zhao 2025-03-10 13:08:03 +0800
  • bed67b234c Minor fix sazc 2025-03-10 13:02:02 +0800
  • ed278eddd3 formats: Optimize get_best_configs implementation sazc 2025-03-10 12:56:14 +0800
  • 50cf26cc7c Performance: Configuration algorithms tuned to minimize the impact of tail effects, now up to 1402 TFLOPS sazc 2025-03-10 11:45:05 +0800
  • 39c10e6c31 Revert "Merge pull request #49 from A-transformer/maximum_fp8_e4m3_value" Chenggang Zhao 2025-03-10 09:47:02 +0800
  • 4d4f2342fe
    Merge pull request #49 from A-transformer/maximum_fp8_e4m3_value Liang 2025-03-08 09:50:06 +0800
  • 629857685e
    Maximum representable value in FP8 E4M3 format A-transformer 2025-03-07 19:58:02 +0400
  • 82d4b996f7
    Add TMA optimization hint for large FFMA segments A-transformer 2025-03-07 19:37:55 +0400
  • de1603d5e4 fix typo dxh 2025-03-07 11:42:51 +0800
  • 0b5d353dba tensor alignment fix dxh 2025-03-07 10:58:37 +0800
  • ba142b891a config cache and tensor alignment fix dxh 2025-03-07 10:54:14 +0800
  • dcf8ec2041
    add assert exists(src_dir) A-transformer 2025-03-06 20:34:59 +0400
  • 9d3222a93e
    Merge pull request #42 from sazczmh/main Liang 2025-03-05 20:09:56 +0800
  • fcd1dcd99d Performance: reducing the percentage of FFMA interleaving yields a slight performance gain, roughly 0.5% sazc 2025-03-05 17:50:22 +0800
  • 68fc742572
    Merge pull request #36 from yz-tang/fix_setup_build Chenggang Zhao 2025-03-04 17:09:50 +0800
  • 6c59e0f40d fix setup build error when setuptools version is lower yz-tang 2025-03-04 16:53:00 +0800
  • 9b0dad8640 Add some notes for promotion Chenggang Zhao 2025-03-04 11:42:20 +0800
  • ded740f736
    Fix documentation of m_grouped_gemm_fp8_fp8_bf16_nt_contiguous in m_grouped_gemm.py Liang 2025-03-04 11:26:23 +0800
  • dff6bb6f0b Add some notes Chenggang Zhao 2025-03-03 11:35:52 +0800
  • 6d0fd7de41 support groupwise scaling for b zhengxuegui.0 2025-03-01 23:04:15 +0800
  • 261fff9c48 Specify uint8_t as enum size Colin Peppler 2025-02-27 21:16:35 -0800
  • 6c5da03ba9 Support more shapes Chenggang Zhao 2025-02-28 10:04:59 +0800
  • b69f630b91 Minor fix util function Chenggang Zhao 2025-02-28 09:46:38 +0800
  • 6e10cba207 Minor fix Chenggang Zhao 2025-02-28 09:21:35 +0800
  • b0b9e03345
    refactor the loop if/else check A-transformer 2025-02-27 22:23:53 +0400
  • 92521df34d
    comment more clear about Memory Consistency and Barrier Visibility A-transformer 2025-02-27 22:01:50 +0400
  • a2e0d68eed
    Merge pull request #2 from deepseek-ai/main A-transformer 2025-02-27 21:47:31 +0400
  • fbec9e5eee
    Update get_best_configs Liang 2025-02-27 23:18:52 +0800
  • 461427ecd0
    Merge pull request #27 from vatlor/main Zhean Xu 2025-02-27 20:37:31 +0800
  • 488b5fc467 fix typo dotrail 2025-02-27 11:53:33 +0000
  • b4d5f535bb
    pytest Integration A-transformer 2025-02-27 14:38:54 +0400
  • 6da94d2d36 Add extra TMA checks Chenggang Zhao 2025-02-27 18:20:57 +0800
  • ca13ce0fab Fix TMA store bugs and code format Chenggang Zhao 2025-02-27 17:57:21 +0800
  • 8933678ee5 upd BBuf 2025-02-27 17:21:18 +0800
  • 22c163be25
    pytest Integration A-transformer 2025-02-27 11:55:32 +0400
  • 60cce9a6e3
    pytest Integration A-transformer 2025-02-27 11:54:28 +0400
  • a813073fac
    pytest Integration A-transformer 2025-02-27 11:53:38 +0400
  • 5479ffebb0
    pytest Integration A-transformer 2025-02-27 11:44:52 +0400
  • f9a6da9ac2
    pytest Integration A-transformer 2025-02-27 11:43:23 +0400
  • 58046b4e01
    pytest Integration A-transformer 2025-02-27 09:48:20 +0400
  • b05ed2f017 Code format Chenggang Zhao 2025-02-27 10:50:20 +0800
  • 676329b8e2
    Merge pull request #19 from dzhulgakov/fix-wheel Chenggang Zhao 2025-02-27 10:44:11 +0800
  • 6e55da296f Fix python -O mode issues Chenggang Zhao 2025-02-27 10:42:46 +0800
  • d5b974da2b
    Merge pull request #16 from AcraeaTerpsicore/patch-1 Chenggang Zhao 2025-02-27 10:34:12 +0800
  • fc7c3f8299 setup.py: fix wheel building Dmytro Dzhulgakov 2025-02-26 17:48:57 +0000
  • 78cacf70d4
    Update README.md Zhean Xu 2025-02-26 19:20:39 +0800
  • 96b31fd6bb
    fix typo AcraeaTerpsicore 2025-02-26 18:37:22 +0800
  • bc989405fe fix: prevent expected_m from exceeding m in test_core xuzhean 2025-02-26 16:55:47 +0800
  • eec7ab7f03
    Merge pull request #13 from ZeppLu/patch-1 Zhean Xu 2025-02-26 16:34:23 +0800
  • 7a70b439cd
    doc: Use permanent link Zepp 2025-02-26 16:15:37 +0800
  • 184ce9b5ea
    Merge pull request #3 from acheong08/patch-1 Chenggang Zhao 2025-02-26 13:29:45 +0800
  • 5da24e229a
    spelling: README.md Antonio Cheong 2025-02-26 02:36:04 +0000
  • a6d97a1c1b Initial commit Chenggang Zhao 2025-02-25 22:52:41 +0800