DeepGEMM

mirror of https://github.com/deepseek-ai/DeepGEMM synced 2025-05-06 22:25:02 +00:00

Author	SHA1	Message	Date
yukuai26	891f35adf5	Support TMA multicast on B with m_grouped_gemm_contiguous. (#88 )	2025-04-21 09:43:17 +08:00
ademeure	6cbff5778f	Correctly flush L2, as reconstructing the tensors on every iteration effectively put them in the L2, and gave the GPU enough idle time to avoid thermal throttling in a potentially unrealistic way. The previous behaviour is potentially representative of some use cases (e.g. previous kernel filling L2 with the data in a very specific way) but not standard benchmarking practice.	2025-03-15 20:46:24 +00:00
Chenggang Zhao	39c10e6c31	Revert "Merge pull request #49 from A-transformer/maximum_fp8_e4m3_value" This reverts commit `4d4f2342fe`, reversing changes made to `9d3222a93e`.	2025-03-10 09:47:02 +08:00
A-transformer	629857685e	Maximum representable value in FP8 E4M3 format Replace Hardcoded 448.0 with Global Constant FP8_E4M3_MAX for FP8 E4M3 Format	2025-03-07 19:58:02 +04:00
AcraeaTerpsicore	96b31fd6bb	fix typo	2025-02-26 18:37:22 +08:00
xuzhean	bc989405fe	fix: prevent expected_m from exceeding m in test_core	2025-02-26 16:55:47 +08:00
Chenggang Zhao	a6d97a1c1b	Initial commit	2025-02-25 22:52:41 +08:00