How to time cublas functions? cublasSgemv V.S nested loops


I have a simple block of code that does a matrix-vector multiplication using the CublasSgemv function and a function that uses two nested for loops to do the same calculation that I wrote. The problem is that I am timing the two operations and the nested for loops is executing faster according to my timing method. This doesn’t seem right to me. I am using CudaEvents to time both operations. Is their something else that I should be doing to get a more accurate reading of the execution time.

Thanks in Advance!!!

Did you warm up the device? The first kernel launch typically takes longer to complete and should not be included in the timing.


No I haven’t done that… I will give it a try and see if it helps. Also are their optimal sizes to make your matrix and vector in order to yield optimal results from the GPU?

I guess that depends on whether your matrices are stored in row-major or column-major format, but I believe using multiples of 32 (warp size) in both dimensions is optimal.


For non-trivially sized problems, I have found CUBLAS SGEMV is a lot faster than the best singled threaded host CPU SGEMV I have access to (I usually used GotoBLAS). And even multiples of 32 are a lot faster than odd sizes, so it does pay to pad storage out to multiples of 32.