GEMV library for NVIDIA GPU

jhkjhk · June 10, 2025, 5:10am

I’m currently researching GEMV kernels (for half precision, FP16) that achieve high performance through device-specific tuning. I’ve found the following libraries:

CUTLASS
cutlass/test/unit/gemm/device/gemv.cu at main · NVIDIA/cutlass · GitHub
FastGEMV
GitHub - wangsiping97/FastGEMV: High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
Custom HGEMV
GitHub - Bruce-Lee-LY/cuda_hgemv: Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.

However, I haven’t been able to find a GEMV library specifically tuned for the A100 GPU.
So my questions are:

Are there any other GEMV libraries I might have missed?
If not, is using Nsight Profiler with various configurations the only way to tune GEMV performance on my device?

Curefab · June 10, 2025, 6:05am

The Ampere architecture introduced sparse matrices and asynchronous copy engines. Both can be useful for matrix multiplications. So the Nsight Profiler and optimizing the configuration would not be enough for fully covering A100 device-specific optimizations.

You could test those kernels against cuBLAS on your device.

system · August 20, 2025, 12:58pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Improving GEMM Kernel Auto-Tuning Efficiency on NVIDIA GPUs with Heuristics and CUTLASS 4.2 Technical Blog	2	95	September 15, 2025
Fast gemv() CUDA Programming and Performance	40	49687	March 22, 2010
cublasSgemv performance question GPU-Accelerated Libraries	5	1013	December 10, 2018
Mixed precision GEMM Performance (A100 & V100) CUDA Programming and Performance	1	1549	December 3, 2021
CUDA stand-alone version of dense matrix-vector multiplication CUDA Programming and Performance	4	1161	May 4, 2022
Dense GEMV issues with K20 versus c2070 CUDA Programming and Performance	14	3962	January 31, 2013
cublasSgemv slower than expected GPU-Accelerated Libraries	7	1088	December 22, 2020
How to operate irregular gemm on tensor core? CUDA Programming and Performance	10	855	August 24, 2024
Exploiting NVIDIA Ampere Structured Sparsity with cuSPARSELt Technical Blog	10	1397	March 14, 2022
Calling cuSparse library on Tesla A100 with CUDA11.1 is much slower than that on Tesla P100 with CUDA9.0 GPU-Accelerated Libraries cuda , nvbugs	1	1068	December 1, 2020

GEMV library for NVIDIA GPU

Related topics