Just Released: CUTLASS 3.8

Originally published at: GitHub - NVIDIA/cutlass: CUDA Templates for Linear Algebra Subroutines

CUTLASS 3.8 extends support to NVIDIA Blackwell SM100 architecture with 99% peak performance for Tensor Core operations, bringing essential features like Mixed Input GEMMs for efficient model quantization and Grouped GEMM capabilities that accelerate MoE models through parallel expert computation.

1 Like