I wish to understand the semantics of cudaThreadSynchronize() when used in a multithreaded program. (I understand this is deprecated in CUDA 4, but I’m using 3.2).
Suppose I have a multithreaded application that creates two pthreads P1 and P2. Within P1 and P2, I perform a kernel invocation. Now, suppose I make a call to cudaThreadSynchronize() in the threads as well before readig the computed results. Does one invocation to cudaThreadSynchronize() force a block on both pthreads’ kernel invocations?
P1’s kernel invocation denoted by P1.k1.
P1’s invocation to cudaThreadSynchronize() by P1.cts1.
P2’s kernel invocation denoted by P2.k2.
P2’s invocation to cudaThreadSynchronize() by P2.cts2.
Suppose the execute occurs in the following manner:
- P1.k1, P2.k2
Now, does (2) ensure that P1.k1 completes before proceeding?