Kernel Timing and cudaThreadSynchronize()

When timing a single kernel launch, cudaThreadSynchronize() is necessary due to the asynchronicity of kernel launches, yes? as follows:

//start timer

myKernel<<<…>>>(…);

cudaThreadSynchronize();

//stop timer.

But, when looping a large number of kernel calls as follows, should the call to cudaThreadSynchronize() happen at each iteration? This is running on a 1.x compute device which, as I understand, does not support concurrent kernel execution.

Is this correct…

//start timer

for (int i = 0; i < 1000; ++i)
{
myKernel<<<…>>>(…);
cudaThreadSynchronize();
}

//stop timer

or is this a more accurate representation?

//start timer

for (int i = 0; i < 1000; ++i)
{
myKernel<<<…>>>(…);
}

cudaThreadSynchronize();

//stop timer

I’ve been thinking that since kernel launches can’t be done concurrently then each launch will automatically sync the threads from the previous launch thus removing the necessity for cudaThreadSynchronize() except for possibly the last kernel call which may trigger the stop on the timer before it finishes (assuming no cudaThreadSynchronize() calls are made). But, when timing a kernel via the second method I get significantly better performance than the first, sometimes 4-5x improvement. Is the overhead from cudaThreadSynchronize() that much?

Thanks!
-Justin

What OS are you timing with?

I’m timing on Linux (Ubuntu 9.04 I believe).

Thanks!
-Justin

And how long are the runtimes of individual kernel launches?

I’ve timed my kernel as follows:

single run/no cudaThreadSynchronize() call : .044 ms

single run/ with cudaThreadSynchronize() call : .076 ms

1000 runs / cudaThreadSynchronize() each iter : 34.6 ms total

1000 runs / no cudaThreadSynchronize() calls : 7.7 ms total

1000 runs / 1 cudaThreadSynchronize() call after loop : 20.7 ms total

I am launching 64 blocks at 512 threads per block on a Tesla C1060.

Also, I’m using a system timer, not CUDA events.

Thanks,

-Justin

Are the kernels in the loop dependent on each other? I am not positive, but since NVIDIA claims that you can only launch one kernel concurrently on pre-fermi technology you can assume that each kernel finishes before the next one starts. However, it may just be sending all of the kernel calls to a queue on the device. If the kernels are independent of each other and you want to time them, then only call cudaThreadSynchronize() once after the loop.

Yes, I and others have also come to a similar conclusion. It would seem that the host is queuing up multiple kernel launches asynchronously even though the device may execute them in order. I’m assuming then that timing without any sync calls is really only timing the kernel launch overhead while timing with a single post-loop sync is timing the device side execution time since the kernel launch overhead time is hidden by the kernel executions taking place on the device. And then using a sync at each iteration would be timing both the kernel overhead as well as the time to call cudaThreadSynchronize() and the actual execution time.