I am benchmarking the CUBLAS kernels on the Tesla C870. I got about 120GFLOPS (no I/O) for Sgemm. I got about 30 GFLOPS (again no I/O) for Sgemv - but this is only for matrix sizes of about 8000. Ssymm seems to be slower than Sgemm, and I can only get about 80 GFLOPS without I/O. And there is a significant performance drop for matrices larger than 2000 - I only get 70 GFLOPS. Ssymv is the poorest performer of them all, I get only a few GFLOPS even for matrix sizes of the order of 2000.
I calculate the FLOPS for matrix multiplication and matrix vector multiplication as 2n^3 and 2n^2 respectively.
The only numbers that look good to me are those for Sgemv, the rest are much lesser than I’d expected them to be. Am I doing something terribly wrong? Has anybody else got similar numbers?
I’d appreciate some feedback on this.