Implementing High Performance Matrix Multiplication Using CUTLASS v2.8

jwitsoe November 23, 2021, 2:35pm 1

Originally published at: Implementing High Performance Matrix Multiplication Using CUTLASS v2.8 | NVIDIA Technical Blog

High performance CUTLASS template abstractions support matrix multiply operations (GEMM), Convolution AI, and improved Strided-DGrad.

Topic		Replies	Views
CUTLASS: Fast Linear Algebra in CUDA C++ Technical Blog	0	427	August 21, 2022
Just Released: CUTLASS 3.8 Technical Blog	1	202	February 4, 2025
CUTLASS: Fast Linear Algebra in CUDA C++ Technical Blog	13	1933	September 9, 2024
Just Released: CUTLASS 3.8 Technical Blog	1	76	January 31, 2025
New cuBLAS 12.0 Features and Matrix Multiplication Performance on NVIDIA Hopper GPUs Technical Blog	0	524	February 1, 2023
Introducing Grouped GEMM APIs in cuBLAS and More Performance Updates Technical Blog	1	235	June 12, 2024
Exploring the New Features of CUDA 11.3 Technical Blog	2	626	April 23, 2021
Pro Tip: cuBLAS Strided Batched Matrix Multiply Technical Blog	11	888	February 16, 2018
Just Released: CUTLASS v2.9 Technical Blog	0	304	June 23, 2022
cuBLAS call from kernel in CUDA 10.0 GPU-Accelerated Libraries	9	4849	April 7, 2021