I programmed an openACC double-precision code for the sparse matrix-vector product based on RTX 2080 Ti. Compared to the sequential CPU code, the GPU code got the 21X speedup. My CPU has the base clock of 3.5GHz (4.4 GHz boost clock) and the GPU has the base clock of 1350 MHz(1545 MHz boost clock). If the clock frequency of the GPU was increased to the clock frequency, the GPU code should got about 63X speedup, which is close to the number of SM of RTX 2080 Ti, i.e. 68 SMs. I tested my code on anther GPU card, GTX 1070, which gives the similar results.
At the beginning, I thought that the speedup ratio is limited by the number of double-precision processors in the SMs of Geforce cards. I expected the speedup ratio should be significantly increased for single-precision calculation, since each SM has as high as 64 Cuda cores, i.e. 64 single-precision processors. Then I tried the single-precision code, but still got about 21X speedup.
It seems that the speedup ratio is highly dependent on the number of SM of the graphic card. How to utilize the so many cuda cores in each SM to obtain the further speedup?
The performance of a GPGPU, e.g. Tesla P100/V100 is much better for the sparse matrix-vector product?