Performance query Odd results profiling GPU speed of matrix multiplication using cublas

I’ve just been profiling cublas multiplying two matrices of random floats of increasing dimension, and got some curious results.
See attached for graph of GPU performance.
I’m curious about the step-cycle observed on both a GTX280 and a Tesla card.
I was wondering if others have seen this? Do you have any suggestions as to why?
Regards, David
speed_upGPUvsCPU.png

There are two different versions of sgemm() in cublas. One is considerably faster than the other. The faster one is used only when the matrix dimensions are nice round multiple of its execution parameters, the slower version otherwise. If you profile your benchmark application you can see the different kernels in the output. I don’t remember what I measured difference in performance for single precision to be, but for double, it is something like 15% difference between the two.