Just Released: CUTLASS 3.8

jwitsoe February 3, 2025, 11:54pm 1

Originally published at: GitHub - NVIDIA/cutlass: CUDA Templates for Linear Algebra Subroutines

Provides support for the NVIDIA Blackwell SM100 architecture. CUTLASS is a collection of CUDA C++ templates and abstractions for implementing high-performance GEMM computations.

Topic		Replies	Views
CUTLASS: Fast Linear Algebra in CUDA C++ Technical Blog	13	2244	September 9, 2024
CUTLASS: Principled Abstractions for Handling Multidimensional Data Through Tensors and Spatial Microkernels Technical Blog	1	62	June 6, 2026
cuBLAS convolution does not use Tensor Cores GPU-Accelerated Libraries cublas	6	2423	June 8, 2021
Where is cute's gemm code? CUDA Programming and Performance	20	2873	October 13, 2024
Cutlass cute-dsl error TensorRT cuda , kernel	1	106	July 31, 2025
Implementing High Performance Matrix Multiplication Using CUTLASS v2.8 Technical Blog	0	563	November 23, 2021
Just Released: CUTLASS 3.8 Technical Blog	0	147	January 31, 2025
How to benchmark on Thor to get the real FP4/FP8 performance TFOPS Jetson Thor nvbugs , benchmarks	10	535	March 16, 2026
Is it correct that my Pascal card is calling Maxwell_gemm kernels through cublas? And if so, why is cublas unusably slow for me? CUDA Programming and Performance	6	1070	August 23, 2018
my speedy SGEMM CUDA Programming and Performance	91	277223	May 29, 2013