The state-of-the-art architecture at the time of that post was Volta, which had Tensor Cores, each capable of doing 64 fused multiply-add (FMA) operations per clock. That’s why the thread tile was organized in a 8 x 8 grid. Ampere did 256 FMA operations per clock. Here’s a good post explaining more.
igor_furoa
15
Related topics
| Topic | Replies | Views | Activity | |
|---|---|---|---|---|
| Just Released: CUTLASS 3.8 | 1 | 402 | February 4, 2025 | |
| Implementing High Performance Matrix Multiplication Using CUTLASS v2.8 | 0 | 556 | November 23, 2021 | |
| my speedy SGEMM | 91 | 276928 | May 29, 2013 | |
| CUTLASS: Fast Linear Algebra in CUDA C++ | 0 | 467 | August 21, 2022 | |
| Are there any blogs about rasterization and swizzle in cutlass? | 1 | 90 | August 11, 2025 | |
| cuBLAS convolution does not use Tensor Cores | 6 | 2405 | June 8, 2021 | |
| How to use slicedK in GEMM? | 2 | 1247 | June 27, 2022 | |
| Where is cute's gemm code? | 20 | 2819 | October 13, 2024 | |
| Where does cutlass' detailed GEMM kernel? | 4 | 1119 | June 16, 2022 | |
| Wmma vs Wgmma On H100 GPU | 5 | 349 | December 15, 2025 |