Implementing High Performance Matrix Multiplication Using CUTLASS v2.8

Originally published at: Implementing High Performance Matrix Multiplication Using CUTLASS v2.8 | NVIDIA Technical Blog

High performance CUTLASS template abstractions support matrix multiply operations (GEMM), Convolution AI, and improved Strided-DGrad.