We can be a bit more general: No, no cudaThreadSynchronize() call is EVER needed to ensure the correctness of your program (as long as you aren’t using the zero-copy feature of CUDA 2.2, or Async memcpy operations, I forgot about that one). Any cudaMemcpy/etc… call will either be queued and executed in order or implicitly synchronize with the GPU.
The only reasons to ever call cudaThreadSynchronize() are
Make sure the GPU has finished executing so you can measure a wall-clock time for benchmarking reasons
Check error codes after kernel calls
In CUDA 2.2: ensure that a kernel has finished so you can read values written to host mapped memory, though events are probably more efficient for that
When using memcpy *Async methods copying to the host
And I would highly recommend not using CUT_CHECK_ERROR. There are often many times when you might want to check for error conditions even in a release build based on an option flag.
I guess that sounds really bad, huh. I say that cudaThreadSynchronize() is never needed and then give a huge list of exceptions… oh well. I was just trying to get the point across that in the vast majority of CUDA applications, cudaThreadSynchronize() is not needed to guarantee correctness. The API/driver will automatically sync for you in almost all cases where it is needed.
Does that also mean you implictly call __syncthreads() inside your kernel only when you are using shared memory? Other wise Cuda sets up the sync barrier for kernel execution? Is that correct?
(Newbie asking questions)