Can a higher speedup be obtained for the sparse matrix-vector product by OpenACC?

I programmed an openACC double-precision code for the sparse matrix-vector product based on RTX 2080 Ti. Compared to the sequential CPU code, the GPU code got the 21X speedup. My CPU has the base clock of 3.5GHz (4.4 GHz boost clock) and the GPU has the base clock of 1350 MHz(1545 MHz boost clock). If the clock frequency of the GPU was increased to the clock frequency, the GPU code should got about 63X speedup, which is close to the number of SM of RTX 2080 Ti, i.e. 68 SMs. I tested my code on anther GPU card, GTX 1070, which gives the similar results.

At the beginning, I thought that the speedup ratio is limited by the number of double-precision processors in the SMs of Geforce cards. I expected the speedup ratio should be significantly increased for single-precision calculation, since each SM has as high as 64 Cuda cores, i.e. 64 single-precision processors. Then I tried the single-precision code, but still got about 21X speedup.

It seems that the speedup ratio is highly dependent on the number of SM of the graphic card. How to utilize the so many cuda cores in each SM to obtain the further speedup?

The performance of a GPGPU, e.g. Tesla P100/V100 is much better for the sparse matrix-vector product?



Hi Jingbo,

Not knowing your algorithm nor having done any analysis on SpMV algorithms in general, I can’t offer too much specific advice.

Though from what you noted, it’s sounding like the code is memory bound or encountering another bottleneck such as thread divergence. I’m not an expert in SpMV algorithms, but often these algorithms have many non-coalesced memory accesses which can lead to the program becoming memory bound.

Have you tried profiling your code using Pgprof or Nvprof with metrics enabled? This may give you a better indication of where the performance limiter is.



We have a sparse matrix GPU PCG code and it is very memory bound. That is why you are getting “good” speedup on a Geforce card even though it has 32x less double precision cores than its equivalent Tesla card.

If it helps, here is a timing chart for our code on multiple GPUs running double precision (the CPU results here use a faster algorithm so you should not compare CPU and GPU - just GPU to GPU). One caveat is that these are wall clock times including IO and some array operations and global MPI collective operations. However, the most time is spent in the sparse matrix-vector product so it might be of help.

Also, results will be highly dependent on your sparse format. Our code uses a diagonal storage (DIA) while most use CSR. From previous NVIDIA presentations I have seen, there was evidence that for many matrices, ELL is best on GPUs (see Basically, the more stride-1 array accesses you can get out of your format for your matrix, the better.

  • Ron