cublas sgemm benchmarks

Hi,

I’m trying to optimize a program where sgemm and strsv calls are the bottleneck, by fiddling with the block sizes. I’m particulary interested in:

  • comparison between cublas 1.1 and cublas 2.0 versions
  • performance for different (also non-quadratic) matrix sizes
  • comparison of different devices

I’d appreciate if you could tell me where to find some substantiated benchmarks or under which conditions sgemm performs optimally.

Regards,
M

I believe sgemm in 2.0 runs best in cases AB and AB^T when height of A is multple of 64 and other dimensions are a multiple of 16. Check out the sgemm source code posted at http://forums.nvidia.com/index.php?showtop…14&#entry314014, it includes timing.