Fastest cuda BLAS GEMM

chrisn256 · September 21, 2016, 8:38am

http://svail.github.io/rnn_perf/

Apparently someone has built a significantly (4x in some cases) faster gemm implementation than cuBLAS for CUDA assuming Maxwell and above processors.

Robert_Crovella · September 21, 2016, 1:18pm

Yes, Scott Gray. He has posted quite a bit here on these forums, mainly in the CUDA performance area.

njuffa · September 21, 2016, 1:22pm

If past history is an indication, NVIDIA’s CUBLAS team will already be already aware of this, and if the source code is available under a BSD license may even incorporate this directly into CUBLAS (note the list of BSD licenses for various codes in the CUBLAS manual).

BLAS, and GEMM in particular, is notorious for researchers being ahead of vendor libraries with regard to specific variants of the functionality (e.g specific sizes, matrix aspect ratios, matrix element types, transpose modes, architecture generations). This has a long tradition in the field going back at least to the times when Kazushige Goto bested the BLAS libraries shipping with the DEC Alpha (around 1990, I think, but my memory is hazy).

Topic		Replies	Views
cuBLAS for Deep Learning? GPU-Accelerated Libraries	0	417	August 31, 2018
Why performance is worse with CUBLAS- than with kernel-function GPU-Accelerated Libraries	3	954	September 5, 2019
Is cuBLAS/cuDNN performance somehow reachable or reproducable with CUDA? CUDA Programming and Performance	7	3854	December 25, 2019
HELP FOR SOURCE FILE OF CUBLAS cuda cublas source CUDA Programming and Performance	4	5565	January 23, 2010
why matrixMul from samples so slow? CUDA Programming and Performance	7	5072	June 7, 2010
cuBLAS vs CUDA kernels Performance GPU-Accelerated Libraries	1	1283	September 14, 2020
Improving the performance of CUBLAS 2.2 CUDA Programming and Performance	0	2302	June 11, 2009
Compile cublas library optimized to GPU architecture? CUDA Programming and Performance	1	3601	April 15, 2009
Calling cuBLAS from device? GPU-Accelerated Libraries cublas	4	796	April 26, 2023
Pro Tip: cuBLAS Strided Batched Matrix Multiply Technical Blog	0	387	November 1, 2021

Fastest cuda BLAS GEMM

Related topics