Update README

2025-05-05 04:54:22 +00:00 · 2025-04-09 11:14:45 +08:00 · 2025-04-09 11:14:45 +08:00 · a9967bc27c
commit a9967bc27c
parent 5a80e4bb96
1 changed files with 9 additions and 5 deletions
--- a/README.md
+++ b/README.md
@ -6,6 +6,10 @@ Currently, DeepGEMM exclusively supports NVIDIA Hopper tensor cores. To address

 Despite its lightweight design, DeepGEMM's performance matches or exceeds expert-tuned libraries across various matrix shapes.

+## News
+
+- 2025.04.09: DeepGEMM now achieves up to **1520 TFLOPS** on H800! See #74, #78, and #81 for details.
+
 ## Performance

 We test all shapes potentially used in DeepSeek-V3/R1 inference (including both prefilling and decoding, but without tensor parallelism) on H800 SXM5 with NVCC 12.8. All speedup metrics are calculated in comparison to our internally and carefully optimized implementation based on CUTLASS 3.6.
@ -28,11 +32,11 @@ DeepGEMM does not behave very well on some shapes, optimization PRs are welcomed
 | 128  | 7168  | 16384 | 645 TFLOPS  |    2604 GB/s     |  1.4x   |
 | 128  | 4096  | 7168  | 533 TFLOPS  |    2221 GB/s     |  2.0x   |
 | 128  | 7168  | 2048  | 510 TFLOPS  |    2277 GB/s     |  1.7x   |
-| 4096 | 2112  | 7168  | 1009 TFLOPS |     503 GB/s     |  1.1x   |
-| 4096 | 24576 | 1536  | 1125 TFLOPS |     893 GB/s     |  1.1x   |
-| 4096 | 32768 |  512  | 751 TFLOPS  |    1569 GB/s     |  1.1x   |
-| 4096 | 7168  | 16384 | 1426 TFLOPS |     361 GB/s     |  1.3x   |
-| 4096 | 4096  | 7168  | 1265 TFLOPS |     485 GB/s     |  1.2x   |
+| 4096 | 2112  | 7168  | 1127 TFLOPS |     562 GB/s     |  1.2x   |
+| 4096 | 24576 | 1536  | 1212 TFLOPS |     962 GB/s     |  1.2x   |
+| 4096 | 32768 |  512  | 775 TFLOPS  |    1620 GB/s     |  1.2x   |
+| 4096 | 7168  | 16384 | 1520 TFLOPS |     384 GB/s     |  1.4x   |
+| 4096 | 4096  | 7168  | 1410 TFLOPS |     541 GB/s     |  1.3x   |
 | 4096 | 7168  | 2048  | 1168 TFLOPS |     794 GB/s     |  1.2x   |

 ### Grouped GEMMs for MoE models (contiguous layout)