I’ve been reading up on some of the work on computing sparse matrix-vector products lately, but I’m not sure how these numbers compare to the dense case.

I played with it a little bit using the latest CUBLAS libraries, and I am getting around 50 GFLOPS on a c2050 for the dense case in single precision. Does this seem about right? It seems a little low to me…

You mean CUDA 3.1, sgemm only reaches 50Gflop/s on C2050, it is impossible. How do you compute Gflop/s and measure elapse time of kernel.

Formally speaing, we would like to use Gflop/s even on SpMV, although we know SpMV is a memory-bound problem.

The reason I think is number of FMA is independent of algorithm, hardware, … etc. It is universal, so we use number of FMA to measure performance of GEMM, SPMV.

No, I mean sgemv, not sgemm. For sgemm I get a much higher FLOP rate. The way I calculated it was by measuring the total time to compute 1000 calls to sgemv, each time using a matrix and vector that are already in GPU memory. The matrix is of size m by n and the vector of size n, so to calculate the GFLOPS I computed mn1000*2/(time_elapsed). The 1000 is because I’m making 1000 calls, the 2 is because I need to compute a multiply and add for each element.

No, I mean sgemv, not sgemm. For sgemm I get a much higher FLOP rate. The way I calculated it was by measuring the total time to compute 1000 calls to sgemv, each time using a matrix and vector that are already in GPU memory. The matrix is of size m by n and the vector of size n, so to calculate the GFLOPS I computed mn1000*2/(time_elapsed). The 1000 is because I’m making 1000 calls, the 2 is because I need to compute a multiply and add for each element.