kernel in loop (time explodes)


i’m calling my kernel within a loop, but the the time i’m measuring seems to explode with the length of the kernel.

for(int i=0; i<1; i++)
1. copy some memory to the device
2. call the kernel

this takes 0.26 ms. if i now let the loop run 360 ( for(int i=0; i<360; i++) ) i measure 5398.12 ms.

why isn’t it just 0.26 * 360 = 93.6 ms (is there some invisible thread synchronisation)??

regards rob

currently i’m trying some loop-unrolling … is there also something else that helps? maybe i missed a part in the cuda manuals

Did you synchronize the threads before stopping the timer?
If you don’t perform an action (after calling the kernel) which requires that the result of the kernel is available (such as a DeviceToHost copy)
then the actual time needed to perform the kernel is not included in your timing. Kernel calls are asynchronous and return control to the host
immediately after calling.


use cudaThreadSynchronize() before the time calculation in your first case when calling kernel just one time.

If you fill up the launch queue, the driver will synchronize on you so it can keep queuing things instead of returning some sort of launch failure.