When timing a single kernel launch, cudaThreadSynchronize() is necessary due to the asynchronicity of kernel launches, yes? as follows:
//start timer
myKernel<<<…>>>(…);
cudaThreadSynchronize();
//stop timer.
But, when looping a large number of kernel calls as follows, should the call to cudaThreadSynchronize() happen at each iteration? This is running on a 1.x compute device which, as I understand, does not support concurrent kernel execution.
Is this correct…
//start timer
for (int i = 0; i < 1000; ++i)
{
myKernel<<<…>>>(…);
cudaThreadSynchronize();
}
//stop timer
or is this a more accurate representation?
//start timer
for (int i = 0; i < 1000; ++i)
{
myKernel<<<…>>>(…);
}
cudaThreadSynchronize();
//stop timer
I’ve been thinking that since kernel launches can’t be done concurrently then each launch will automatically sync the threads from the previous launch thus removing the necessity for cudaThreadSynchronize() except for possibly the last kernel call which may trigger the stop on the timer before it finishes (assuming no cudaThreadSynchronize() calls are made). But, when timing a kernel via the second method I get significantly better performance than the first, sometimes 4-5x improvement. Is the overhead from cudaThreadSynchronize() that much?
Are the kernels in the loop dependent on each other? I am not positive, but since NVIDIA claims that you can only launch one kernel concurrently on pre-fermi technology you can assume that each kernel finishes before the next one starts. However, it may just be sending all of the kernel calls to a queue on the device. If the kernels are independent of each other and you want to time them, then only call cudaThreadSynchronize() once after the loop.
Yes, I and others have also come to a similar conclusion. It would seem that the host is queuing up multiple kernel launches asynchronously even though the device may execute them in order. I’m assuming then that timing without any sync calls is really only timing the kernel launch overhead while timing with a single post-loop sync is timing the device side execution time since the kernel launch overhead time is hidden by the kernel executions taking place on the device. And then using a sync at each iteration would be timing both the kernel overhead as well as the time to call cudaThreadSynchronize() and the actual execution time.