CUTLASS: Fast Linear Algebra in CUDA C++

The state-of-the-art architecture at the time of that post was Volta, which had Tensor Cores, each capable of doing 64 fused multiply-add (FMA) operations per clock. That’s why the thread tile was organized in a 8 x 8 grid. Ampere did 256 FMA operations per clock. Here’s a good post explaining more.