Kernels and For Loops

I have a question that I am hoping someone might be able to answer:

I am using a For Loop to execute a kernel multiple times, for the first 16 iterations, the processing time is consistent with what the average time should be. However, after the 16th iteration, the processing time jumps significantly, ie from 20 micro seconds to 5 milliseconds. This continues to occur even though I am using multiple kernels…after the 16th kernel call, the processing time sky rockets!

Can someone explain why this is happening, your time is greatly appreciated!

Kernels are launched asynchronously, so all kernel calls return right away: your measured time for the first 16 calls should be essentially 0. The queue depth for kernels is 16, at which point there is an implicit synchronization with the device before queuing the next async kernel.

You can synchronize with the device yourself for timing purposes by calling cudaThreadSynchronize or using the events API.

Thank you very much, I really appreciate it. I originally had a thread sync call but removed it, I will need to go back and put it back on.