I have a simple block of code that does a matrix-vector multiplication using the CublasSgemv function and a function that uses two nested for loops to do the same calculation that I wrote. The problem is that I am timing the two operations and the nested for loops is executing faster according to my timing method. This doesn’t seem right to me. I am using CudaEvents to time both operations. Is their something else that I should be doing to get a more accurate reading of the execution time.
No I haven’t done that… I will give it a try and see if it helps. Also are their optimal sizes to make your matrix and vector in order to yield optimal results from the GPU?
I guess that depends on whether your matrices are stored in row-major or column-major format, but I believe using multiples of 32 (warp size) in both dimensions is optimal.
For non-trivially sized problems, I have found CUBLAS SGEMV is a lot faster than the best singled threaded host CPU SGEMV I have access to (I usually used GotoBLAS). And even multiples of 32 are a lot faster than odd sizes, so it does pay to pad storage out to multiples of 32.