Pro Tip: cuBLAS Strided Batched Matrix Multiply

Originally published at: Pro Tip: cuBLAS Strided Batched Matrix Multiply | NVIDIA Developer Blog

There’s a new computational workhorse in town. For decades, general matrix-matrix multiply—known as GEMM in Basic Linear Algebra Subroutines (BLAS) libraries—has been a standard benchmark for computational performance. GEMM is possibly the most optimized and widely used routine in scientific computing. Expert implementations are available for every architecture and quickly achieve the peak performance of…