Improving GEMM Kernel Auto-Tuning Efficiency on NVIDIA GPUs with Heuristics and CUTLASS 4.2

jwitsoe · September 2, 2025, 5:00pm

Originally published at: Improving GEMM Kernel Auto-Tuning Efficiency on NVIDIA GPUs with Heuristics and CUTLASS 4.2 | NVIDIA Technical Blog

Selecting the best possible General Matrix Multiplication (GEMM) kernel for a specific problem and hardware is a significant challenge. The performance of a GEMM kernel is determined by an array of compile-time and runtime meta-parameters: CTA, warp and instruction level tile sizes, kernel schedules, rasterization strategies, cluster dimensions, split-k factors, and so on. The traditional…

barrotdev · September 15, 2025, 9:00pm

Is there a way to specify an implicit GEMM kernel search? This would be very valuable for CNN networks with channel counts that are not multiples of 32 or higher, or some kernel shape and channel count combo that would cause a tail effect or non-coalesced memory access pattern, or non optimal stride … you get the point.

Topic		Replies	Views
Autotuning for GEMM kernel and combination with other kernels GPU-Accelerated Libraries cublas	1	513	December 1, 2022
GEMV library for NVIDIA GPU CUDA Programming and Performance	2	415	June 10, 2025
cublasLtMatmulAlgoGetHeuristic - How does this function select the kernel based on various parameters? GPU-Accelerated Libraries cuda , kernel , cublas	0	98	January 10, 2025
CUTLASS: Fast Linear Algebra in CUDA C++ Technical Blog	13	2158	September 9, 2024
CUTLASS 3.x: Orthogonal, Reusable, and Composable Abstractions for GEMM Kernel Design Technical Blog	1	43	July 16, 2025
Source Code of Cutlass GemmKernel from Basic Gemm CUDA Programming and Performance benchmarks	1	112	April 16, 2025
Just Released: CUTLASS 3.8 Technical Blog	1	388	February 4, 2025
Introducing Grouped GEMM APIs in cuBLAS and More Performance Updates Technical Blog	1	299	June 12, 2024
Where does cutlass' detailed GEMM kernel? GPU-Accelerated Libraries cutlass	4	1103	June 16, 2022
Implementing High Performance Matrix Multiplication Using CUTLASS v2.8 Technical Blog	0	549	November 23, 2021

Improving GEMM Kernel Auto-Tuning Efficiency on NVIDIA GPUs with Heuristics and CUTLASS 4.2

Related topics