cudaThreadSynchronize() after kernel call?

When you run a timer under the duration of a kernel, do you need to use cudaThreadSynchronize() before you stop the timer to get the accurate timing? I do use the __syncthreads() command in the end of the kernel to make sure all threads are finished.

What is the difference between these 2 commands? There is a significant timing-difference when using the outer sync and when not.

synthreads works on the GPU
cudathreadsynchronize will block until all kernels are finished. For timing purposes it is necessary to use cudathreadsynchronize

Thanks!

__syncthreads() only acts as a barrier for threads within a block. You do not need it to “make sure that all threads are finished”. __syncthreads() is only needed to prevent race conditions when multiple threads access the same region of shared memory.

In my opinion, if you use the GPU timer, you need to use the cudathreadsynchronize, but with CPU timer, it is not needed, right?

No, wrong. Kernel launches are non-blocking on the host. If you don’t use cudaThreadSynchronize before stopping a timer after a kernel launch, the timer will only measure the kernel launch time, no the time it took the kernel to finish executing.