I am benchmarking my CUDA kernel implementations for SGEMM and SGEMV. Part of this, I called cuBLAS functions such as cublasSgemm and cublasSgemv respectively. My GPU is a RTX3050 Mobile with a peak performance of 5.501 TFLOPs for FP32 (source).

In my case, I am using square matrices for testing. The total floating point operations for SGEMM is 2*(N^3+N^2) (Source : Lower Bounding the Fastest Possible Runtime). Likewise, the total floating point operations for SGEMV is 2*(N^2+N).

For SGEMM, the performance of the kernel against the peak GPU capacity can be computed as 2*(N^3 + N^2) [FLOP] / (KERNEL_TIME [S] * GPU_MAX_PERFORMANCE [FLOPs]).

(TLDR) : I get about 50% utilization of GPU for SGEMM whereas only 1.5% is utilized for SGEMV in cuBLAS. Any reason why there is a large difference for SGEMV in cuBLAS?

SGEMM is a BLAS-3 (matrix-matrix) operation. It is very compute intensive and usually performance limited by compute throughput (there can be exceptions for extreme matrix aspect ratios).

SGEMV is a BLAS-2 (matrix-vector) operation. It is memory intensive and its performance is therefore limited by memory throughput.

The general performance characteristics are inherent in the operations themselves and thus shared by BLAS implementations across platforms, that is, you will make analogous observations if you run BLAS on the host system, e.g. Intel MKL or OneAPI implementations.

It might be instructional to run with Nsight compute (that is, a CUDA profiler) to verify the above statements. You might also want to familiarize yourself with the roofline model of performance.