Behaviour of Multithreaded programs with cudaThreadSynchronize() The semantics of cudaThreadSynchron


I wish to understand the semantics of cudaThreadSynchronize() when used in a multithreaded program. (I understand this is deprecated in CUDA 4, but I’m using 3.2).

Suppose I have a multithreaded application that creates two pthreads P1 and P2. Within P1 and P2, I perform a kernel invocation. Now, suppose I make a call to cudaThreadSynchronize() in the threads as well before readig the computed results. Does one invocation to cudaThreadSynchronize() force a block on both pthreads’ kernel invocations?


P1’s kernel invocation denoted by P1.k1.
P1’s invocation to cudaThreadSynchronize() by P1.cts1.
P2’s kernel invocation denoted by P2.k2.
P2’s invocation to cudaThreadSynchronize() by P2.cts2.

Suppose the execute occurs in the following manner:

  1. P1.k1, P2.k2
  2. P2.cts2

Now, does (2) ensure that P1.k1 completes before proceeding?


In CUDA 3.2, each thread has it’s own CUDA context and thus its own separate stream of kernel launches. In your example, the cudaThreadSynchronization in P2 only ensures that the kernel P2.k2 call is complete.

Note that is is generally better to insert events and use cudaStreamWaitEvent or other fine-grained sync mechanisms than big hammer cudaThread(Device)Synchronize.