I have a simple block of code that does a matrix-vector multiplication using the CublasSgemv function and a function that uses two nested for loops to do the same calculation that I wrote. The problem is that I am timing the two operations and the nested for loops is executing faster according to my timing method. This doesn’t seem right to me. I am using CudaEvents to time both operations. Is their something else that I should be doing to get a more accurate reading of the execution time.
Thanks in Advance!!!