I’m having a kernel launch and associated memory transfers between a cutTimerStart / cutTimerStop pair, using cutGetTimerValue to obtain the time spent running the algo in ms. When I scale up the size of the problem, I systematically get back around 124 to 141 ms it is not proportional to the size of the problem. On the other hand, the wall clock time spent does increase linearly (as far as I can say!) with the size of the problem.
Do I need to compile with dbg=1 to get proper timing? Is there anything I need to do to use the timers beside cutCreateTimer?
add a cudaThreadSynchronize() after the kernel call if you haven’t done that already.
ideally, use streams and events for timing, as they will be much more accurate for short kernels than CPU-based timers.
Thanks for the tip, cudaThreadSynchronize() did the job.
Now I’m wondering how many rookies like myself got lured into thinking their code was running that much faster on the GPU vs. a CPU implementation. Wouldn’t it be safer if cutStopTimer would first call for synchronization before taking the time?
I’ll look up streams and event!
If it makes you happier: I was - that is kind of embarrassing actually ;)
unfortunately this did not solve the problem for me. I am running the test within a loop that varies the array sizes that I am operating simple math on. The first couple tests give realistic values for flops. Once the arrays surpass 10 million elements the flops values are way off. The test is obviously taking much longer as I am waiting longer on the ouput but the times returned are much lower than the faster earlier tests. Is there a time limit on the timer where it will restart at 0, that could explain it. Otherwise I’m pretty confused.
I just now tried using events to record the kernel time as in the simple streams example and am getting the unrealistic FLOP values once again on the longer arrays. I am clueless at this point, any help would be greatly appreciated.
I think I discovered my problem. The grid size that I was submitting to the kernel was too large for the tests giving me unrealistic times. Odd though that I still had to wait longer for it to do something, perhaps the memcpy was what I was waiting for. If I did want to process a large 1D array that would exceed the max grid size, is it best to simply split it up or is there a better alternative using hone memcpy but multiple kernel invocations?
You could just launch a 2D grid and treat it as a big 1D one.
The same way you would access a 1D array* with array[y*width+i]
Have your thread index be something like
const int idx = (blockIdx.y*blockDim.x*gridDim.x)+blockIdx.x*blockDim.x+thre