Originally published at: GitHub - NVIDIA/cutlass: CUDA Templates for Linear Algebra Subroutines
CUTLASS 3.8 extends support to NVIDIA Blackwell SM100 architecture with 99% peak performance for Tensor Core operations, bringing essential features like Mixed Input GEMMs for efficient model quantization and Grouped GEMM capabilities that accelerate MoE models through parallel expert computation.
1 Like