CUDA stand-alone version of dense matrix-vector multiplication


We have an OpenACC implementation for dense matrix vector multiplication

Y[M] = A[M,N]*X[N], where (M >> N, M ~ 1000-10000, N ~ 10-1000)

Now we try to implement it using CUDA. We tested the CUDA implementation shown on stack

but found that performance of this implementation is actually worse than OpenACC implementation on A100 GPU

Perhaps this version is too old. Are there any other CUDA implementations of dense matrix-vector multiplication available?

Thanks. /Jing

use the cublas library gemv function

Hi Robert,

use the cublas library gemv

Thanks for your suggestion.

We indeed obtained better performances using gemv in comparison with the stand-alone cuda version. However, The performances of gemv seem to be very similar with the OpenACC version,

We have two test cases on A100 GPU,

1) row=3000, col=1538, iterations = 10000, single precision
142.5 ms (OpenACC) vs  133.1 ms (CUBLAS)

2)  row=10000, col=1538, iterations = 10000, single precision
494 ms (OpenACC) vs  536 ms (CUBLAS)

Are the performance of cublas reasonable?

Thanks /Jing

gemv is a BLAS2 (matrix-vector) operation, and those are typically limited by memory throughput. The consequence of that is that any implementation that pays attention to some basic principles of maximizing memory throughput is going to perform about the same. In light of that your data points seem plausible.

Significant performance differences based on implementation details typically occur with BLAS3 (matrix-matrix) operations, which are mostly compute bound, especially as long as the matrices are square-ish. I would be surprised if a home-grown implementation of gemm using OpenACC can perform roughly as well as the corresponding CUBLAS implementation.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.