Originally published at: Improving GEMM Kernel Auto-Tuning Efficiency on NVIDIA GPUs with Heuristics and CUTLASS 4.2 | NVIDIA Technical Blog
Selecting the best possible General Matrix Multiplication (GEMM) kernel for a specific problem and hardware is a significant challenge. The performance of a GEMM kernel is determined by an array of compile-time and runtime meta-parameters: CTA, warp and instruction level tile sizes, kernel schedules, rasterization strategies, cluster dimensions, split-k factors, and so on. The traditional…
Is there a way to specify an implicit GEMM kernel search? This would be very valuable for CNN networks with channel counts that are not multiples of 32 or higher, or some kernel shape and channel count combo that would cause a tail effect or non-coalesced memory access pattern, or non optimal stride … you get the point.