is cudaThreadSynchronize() will take 600+ms to execute?

Hello,

is cudaThreadSynchronize() will take 600+ms to execute??

becuase I’m executing below 2 lines code and it is taking 600+ ms

foo<<<32,32>>>( … );
cudaThreadSynchronize() ;

and when I execute only kernal function,

foo<<<32,32>>>( … );

it taking 0.1ms.

Why?

cudaThreadSynchronize() will wait for all previous kernel invocations to finish. If there are unfinished kernels, then cudaThreadSynchronize() will block until they are complete.

Kernel launches are asynchronous, so timing the kernel without cudaThreadSynchronize() is measuring only the time to launch the kernel, not the time it takes for the kernel to finish.

Hi!

What is your way to measure time? A cudaThreadSynchronize() is needed to correctly measure time elapsed by the kernel run – and not only the kernel launch.

Oops, I’ve been to slow…