We have an OpenACC implementation for dense matrix vector multiplication
Y[M] = A[M,N]*X[N], where (M >> N, M ~ 1000-10000, N ~ 10-1000)
Now we try to implement it using CUDA. We tested the CUDA implementation shown on stack
but found that performance of this implementation is actually worse than OpenACC implementation on A100 GPU
Perhaps this version is too old. Are there any other CUDA implementations of dense matrix-vector multiplication available?
use the cublas library gemv function
use the cublas library gemv
Thanks for your suggestion.
We indeed obtained better performances using gemv in comparison with the stand-alone cuda version. However, The performances of gemv seem to be very similar with the OpenACC version,
We have two test cases on A100 GPU,
1) row=3000, col=1538, iterations = 10000, single precision
142.5 ms (OpenACC) vs 133.1 ms (CUBLAS)
2) row=10000, col=1538, iterations = 10000, single precision
494 ms (OpenACC) vs 536 ms (CUBLAS)
Are the performance of cublas reasonable?
gemv is a BLAS2 (matrix-vector) operation, and those are typically limited by memory throughput. The consequence of that is that any implementation that pays attention to some basic principles of maximizing memory throughput is going to perform about the same. In light of that your data points seem plausible.
Significant performance differences based on implementation details typically occur with BLAS3 (matrix-matrix) operations, which are mostly compute bound, especially as long as the matrices are square-ish. I would be surprised if a home-grown implementation of
gemm using OpenACC can perform roughly as well as the corresponding CUBLAS implementation.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.