Just Released: CUTLASS 3.8

jwitsoe · January 31, 2025, 9:40pm

Originally published at: GitHub - NVIDIA/cutlass: CUDA Templates for Linear Algebra Subroutines

CUTLASS 3.8 extends support to NVIDIA Blackwell SM100 architecture with 99% peak performance for Tensor Core operations, bringing essential features like Mixed Input GEMMs for efficient model quantization and Grouped GEMM capabilities that accelerate MoE models through parallel expert computation.

Topic		Replies	Views
Just Released: CUTLASS 3.8 Technical Blog	1	202	February 4, 2025
Exploring the New Features of CUDA 11.3 Technical Blog	2	626	April 23, 2021
Just Released: CUTLASS v2.9 Technical Blog	0	304	June 23, 2022
Implementing High Performance Matrix Multiplication Using CUTLASS v2.8 Technical Blog	0	511	November 23, 2021
Just Released: NVIDIA cuDNN 9.7 Technical Blog cudnn	1	50	January 31, 2025
CUDA 9.2 Now Available Technical Blog	0	213	August 21, 2022
New cuBLAS 12.0 Features and Matrix Multiplication Performance on NVIDIA Hopper GPUs Technical Blog	0	524	February 1, 2023
CUDA 10.1 Now Available Technical Blog	0	233	August 21, 2022
CUDA 9.2 GA is now available for download Announcements	0	1279	May 21, 2018
cuBLAS GEMM INT8 is much slower than FP16 in T4 GPU-Accelerated Libraries cublas	11	4316	November 2, 2023

Just Released: CUTLASS 3.8

Related topics