CUDA stand-alone version of dense matrix-vector multiplication

gongjing · April 29, 2022, 2:44pm

Hi,

We have an OpenACC implementation for dense matrix vector multiplication

Y[M] = A[M,N]*X[N], where (M >> N, M ~ 1000-10000, N ~ 10-1000)

Now we try to implement it using CUDA. We tested the CUDA implementation shown on stack

but found that performance of this implementation is actually worse than OpenACC implementation on A100 GPU

Perhaps this version is too old. Are there any other CUDA implementations of dense matrix-vector multiplication available?

Thanks. /Jing

Robert_Crovella · April 29, 2022, 3:55pm

use the cublas library gemv function

gongjing · May 4, 2022, 9:15am

Hi Robert,

use the cublas library gemv

Thanks for your suggestion.

We indeed obtained better performances using gemv in comparison with the stand-alone cuda version. However, The performances of gemv seem to be very similar with the OpenACC version,

We have two test cases on A100 GPU,

1) row=3000, col=1538, iterations = 10000, single precision
142.5 ms (OpenACC) vs  133.1 ms (CUBLAS)

2)  row=10000, col=1538, iterations = 10000, single precision
494 ms (OpenACC) vs  536 ms (CUBLAS)

Are the performance of cublas reasonable?

Thanks /Jing

njuffa · May 4, 2022, 10:17am

gemv is a BLAS2 (matrix-vector) operation, and those are typically limited by memory throughput. The consequence of that is that any implementation that pays attention to some basic principles of maximizing memory throughput is going to perform about the same. In light of that your data points seem plausible.

Significant performance differences based on implementation details typically occur with BLAS3 (matrix-matrix) operations, which are mostly compute bound, especially as long as the matrices are square-ish. I would be surprised if a home-grown implementation of gemm using OpenACC can perform roughly as well as the corresponding CUBLAS implementation.

system · May 18, 2022, 10:17am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Matrix Multiplications, Latency, CUDA vs OpenMP CUDA Programming and Performance	5	11965	January 8, 2010
CUBLAS - low performance on matrix multiplication CUDA Programming and Performance	7	18198	March 30, 2011
CuBLAS Showing Poor Performance CUDA Programming and Performance	6	1176	December 20, 2013
Trouble with CUBLAS GEMM Strided Batch GPU-Accelerated Libraries cublas	3	906	June 8, 2021
CUBLAS VS CBLAS sgemv Benchmarking matrix-vector operations on GPU and CPU CUDA Programming and Performance	5	10029	March 24, 2014
Bad performance of cublas for extremely small matrix multiplication? GPU-Accelerated Libraries cublas	4	928	May 1, 2024
cuBLAS vs CUDA kernels Performance GPU-Accelerated Libraries	1	1270	September 14, 2020
CUBLAS matrix-vector multiplication CUDA Programming and Performance	14	10030	January 20, 2010
best possible matrix-vector multiplication performance? poor guy with only an emulator wonders about CUDA Programming and Performance	6	5599	August 12, 2009
simple matrix (or matrix vector) multiplication using CUBLAS CUDA Programming and Performance	9	5587	November 25, 2009

CUDA stand-alone version of dense matrix-vector multiplication

Related topics