No need to check cudaThreadSynchronize() in release mode?


After evry global function, we should call CUT_CHECK_ERROR(errorMessage) macro.
This macro calls 2 functions cudaGetLastError() and cudaThreadSynchronize().

It is defined in Cutil.h like this,

#ifdef _DEBUG

define CUT_CHECK_ERROR(errorMessage) do { \

cudaError_t err = cudaGetLastError();
if( cudaSuccess != err) {
fprintf(stderr, “Cuda error: %s in file ‘%s’ in line %i : %s.\n”,
errorMessage, FILE, LINE, cudaGetErrorString( err) );
err = cudaThreadSynchronize();
if( cudaSuccess != err) {
fprintf(stderr, “Cuda error: %s in file ‘%s’ in line %i : %s.\n”,
errorMessage, FILE, LINE, cudaGetErrorString( err) );
} } while (0)


define CUT_CHECK_ERROR(errorMessage)


This means, in release mode this macro just do nothing.

So, No need to synchronize the threads in Release mode? or
GPU will do it automatically?

No, synchronization is necessary when using asynchronous operations using streams. See programming guide section

We can be a bit more general: No, no cudaThreadSynchronize() call is EVER needed to ensure the correctness of your program (as long as you aren’t using the zero-copy feature of CUDA 2.2, or Async memcpy operations, I forgot about that one). Any cudaMemcpy/etc… call will either be queued and executed in order or implicitly synchronize with the GPU.

The only reasons to ever call cudaThreadSynchronize() are

  1. Make sure the GPU has finished executing so you can measure a wall-clock time for benchmarking reasons
  2. Check error codes after kernel calls
  3. In CUDA 2.2: ensure that a kernel has finished so you can read values written to host mapped memory, though events are probably more efficient for that
  4. When using memcpy *Async methods copying to the host

And I would highly recommend not using CUT_CHECK_ERROR. There are often many times when you might want to check for error conditions even in a release build based on an option flag.

I guess that sounds really bad, huh. I say that cudaThreadSynchronize() is never needed and then give a huge list of exceptions… oh well. I was just trying to get the point across that in the vast majority of CUDA applications, cudaThreadSynchronize() is not needed to guarantee correctness. The API/driver will automatically sync for you in almost all cases where it is needed.


I’m a newbie, so let me ask a question:

What if you call a kernel1 which calculates some outputs that are needed as inputs to kernel2. Am I just wasting my time, calling cudaThreadSynchronize() before calling kernel2?


Yes. Because only one kernel can run at a time, there’s already an implicit synchronization barrier between kernel1 and kernel2.

Does that also mean you implictly call __syncthreads() inside your kernel only when you are using shared memory? Other wise Cuda sets up the sync barrier for kernel execution? Is that correct?
(Newbie asking questions)

__syncthreads() and cudaThreadSynchronize() are completely different things.

You never need to call __syncthreads() except to prevent race conditions when writing/reading shared memory.

is cudaThreadSynchronize() will take 600+ms to execute??

becuase I’m executing below 2 lines code and it is taking 600+ ms

foo<<<32,32>>>( … );

cudaThreadSynchronize() ;

and when I execute only kernal function,

foo<<<32,32>>>( … );

it taking 0.1ms.


The 600+ms is the time your kernel runs.

The kernel call is async and therefore if you don’t put cudaThreadSyncronize (or other kernel call, or other memcpy

as suggested by tmurray and MisterAnderson) you imidiatly go back to CPU host code. And if you measure this you’ll get 0.1ms.

On the other hand putting a cudaThreadSyncronize after the kernel run (or other kernel call or cudamemcpy which will implecitly call cudaThreadSync)

and ONLY then measuring the time, you’ll see the actuall time your kernel ran.