Kernel Timing and cudaThreadSynchronize()

Justinv · July 29, 2010, 10:09pm

When timing a single kernel launch, cudaThreadSynchronize() is necessary due to the asynchronicity of kernel launches, yes? as follows:

//start timer

myKernel<<<…>>>(…);

cudaThreadSynchronize();

//stop timer.

But, when looping a large number of kernel calls as follows, should the call to cudaThreadSynchronize() happen at each iteration? This is running on a 1.x compute device which, as I understand, does not support concurrent kernel execution.

Is this correct…

//start timer

for (int i = 0; i < 1000; ++i)
{
myKernel<<<…>>>(…);
cudaThreadSynchronize();
}

//stop timer

or is this a more accurate representation?

//start timer

for (int i = 0; i < 1000; ++i)
{
myKernel<<<…>>>(…);
}

cudaThreadSynchronize();

//stop timer

I’ve been thinking that since kernel launches can’t be done concurrently then each launch will automatically sync the threads from the previous launch thus removing the necessity for cudaThreadSynchronize() except for possibly the last kernel call which may trigger the stop on the timer before it finishes (assuming no cudaThreadSynchronize() calls are made). But, when timing a kernel via the second method I get significantly better performance than the first, sometimes 4-5x improvement. Is the overhead from cudaThreadSynchronize() that much?

Thanks!
-Justin

tmurray · July 29, 2010, 10:16pm

What OS are you timing with?

Justinv · July 29, 2010, 10:26pm

I’m timing on Linux (Ubuntu 9.04 I believe).

Thanks!
-Justin

tmurray · July 29, 2010, 11:07pm

And how long are the runtimes of individual kernel launches?

Justinv · July 30, 2010, 12:05am

I’ve timed my kernel as follows:

single run/no cudaThreadSynchronize() call : .044 ms

single run/ with cudaThreadSynchronize() call : .076 ms

1000 runs / cudaThreadSynchronize() each iter : 34.6 ms total

1000 runs / no cudaThreadSynchronize() calls : 7.7 ms total

1000 runs / 1 cudaThreadSynchronize() call after loop : 20.7 ms total

I am launching 64 blocks at 512 threads per block on a Tesla C1060.

Also, I’m using a system timer, not CUDA events.

Thanks,

-Justin

menohack · July 30, 2010, 10:08pm

Are the kernels in the loop dependent on each other? I am not positive, but since NVIDIA claims that you can only launch one kernel concurrently on pre-fermi technology you can assume that each kernel finishes before the next one starts. However, it may just be sending all of the kernel calls to a queue on the device. If the kernels are independent of each other and you want to time them, then only call cudaThreadSynchronize() once after the loop.

Justinv · July 30, 2010, 10:24pm

Yes, I and others have also come to a similar conclusion. It would seem that the host is queuing up multiple kernel launches asynchronously even though the device may execute them in order. I’m assuming then that timing without any sync calls is really only timing the kernel launch overhead while timing with a single post-loop sync is timing the device side execution time since the kernel launch overhead time is hidden by the kernel executions taking place on the device. And then using a sync at each iteration would be timing both the kernel overhead as well as the time to call cudaThreadSynchronize() and the actual execution time.

Topic		Replies	Views
cudaThreadSynchronize() and multiple kernels when is it necessary to sync? CUDA Programming and Performance	2	8335	June 20, 2008
cost for launching (a lot of) CUDA kernels CUDA Programming and Performance	5	9687	April 15, 2010
What determines the amount of time spent on my `cudaSynchronize` call? CUDA Programming and Performance	1	1103	February 21, 2019
Multi kernels CUDA Programming and Performance	13	4657	July 3, 2007
Got wrong result when not using cudaDeviceSynchronize in threads CUDA Programming and Performance	6	824	February 1, 2024
cudaThreadSynchronize CUDA Programming and Performance	7	5097	December 22, 2009
When do I need cudaThreadSynchronize? CUDA Programming and Performance	3	11264	June 16, 2010
Speed up due to a kernel launch ? CUDA Programming and Performance	3	1191	December 26, 2009
Concurrent Kernels Bug / Undocumented Behavior (Urgent) need info on "simple" problem with c CUDA Programming and Performance	2	905	June 18, 2010
Newbie: async kernel, so I can do stuff on the CPU meanwhile, yeah? CUDA Programming and Performance	2	373	January 13, 2019

Kernel Timing and cudaThreadSynchronize()

Related topics