We indeed obtained better performances using gemv in comparison with the stand-alone cuda version. However, The performances of gemv seem to be very similar with the OpenACC version,
We have two test cases on A100 GPU,
1) row=3000, col=1538, iterations = 10000, single precision
142.5 ms (OpenACC) vs 133.1 ms (CUBLAS)
2) row=10000, col=1538, iterations = 10000, single precision
494 ms (OpenACC) vs 536 ms (CUBLAS)
gemv is a BLAS2 (matrix-vector) operation, and those are typically limited by memory throughput. The consequence of that is that any implementation that pays attention to some basic principles of maximizing memory throughput is going to perform about the same. In light of that your data points seem plausible.
Significant performance differences based on implementation details typically occur with BLAS3 (matrix-matrix) operations, which are mostly compute bound, especially as long as the matrices are square-ish. I would be surprised if a home-grown implementation of gemm using OpenACC can perform roughly as well as the corresponding CUBLAS implementation.