SGEMM and SGEMV - large performance difference in cuBLAS

skps23 · April 7, 2024, 10:15pm

I am benchmarking my CUDA kernel implementations for SGEMM and SGEMV. Part of this, I called cuBLAS functions such as cublasSgemm and cublasSgemv respectively. My GPU is a RTX3050 Mobile with a peak performance of 5.501 TFLOPs for FP32 (source).

In my case, I am using square matrices for testing. The total floating point operations for SGEMM is 2*(N^3+N^2) (Source : Lower Bounding the Fastest Possible Runtime). Likewise, the total floating point operations for SGEMV is 2*(N^2+N).

For SGEMM, the performance of the kernel against the peak GPU capacity can be computed as 2*(N^3 + N^2) [FLOP] / (KERNEL_TIME [S] * GPU_MAX_PERFORMANCE [FLOPs]).

(TLDR) : I get about 50% utilization of GPU for SGEMM whereas only 1.5% is utilized for SGEMV in cuBLAS. Any reason why there is a large difference for SGEMV in cuBLAS?

njuffa · April 7, 2024, 11:14pm

SGEMM is a BLAS-3 (matrix-matrix) operation. It is very compute intensive and usually performance limited by compute throughput (there can be exceptions for extreme matrix aspect ratios).

SGEMV is a BLAS-2 (matrix-vector) operation. It is memory intensive and its performance is therefore limited by memory throughput.

The general performance characteristics are inherent in the operations themselves and thus shared by BLAS implementations across platforms, that is, you will make analogous observations if you run BLAS on the host system, e.g. Intel MKL or OneAPI implementations.

It might be instructional to run with Nsight compute (that is, a CUDA profiler) to verify the above statements. You might also want to familiarize yourself with the roofline model of performance.

Topic		Replies	Views
Low performance on SGEMV CUDA Programming and Performance	2	2246	June 22, 2007
sgemm precision wrong results cublasSgemm vs MKL sgemm CUDA Programming and Performance	4	5339	December 22, 2007
CUBLAS Configuration The use of CUBLAS for small matrix CUDA Programming and Performance	3	3727	April 4, 2007
CUBLAS SGEMM on highly rectangular matrices CUDA Programming and Performance	1	3226	February 20, 2010
Low CuBLAS performance CUDA Programming and Performance	3	437	January 15, 2019
cublasSgemv & TransferTime CUDA Programming and Performance	3	10314	August 18, 2007
Low performance on SGEMV CUDA Programming and Performance	0	3537	April 30, 2007
Batched CUBLAS Questions CUDA Programming and Performance	4	1640	June 9, 2015
Cuda SGEMM same speed as APPLE veclibs ? CUDA Programming and Performance	8	10619	May 8, 2008
CUBLAS VS CBLAS sgemv Benchmarking matrix-vector operations on GPU and CPU CUDA Programming and Performance	5	10031	March 24, 2014

SGEMM and SGEMV - large performance difference in cuBLAS

Related topics