I’ve noticed that when __syncthreads() rules are broken (e.g., either all threads or no threads must reach every __syncthreads() call), program behavior becomes undefined. This can manifest itself as threads “breaking through” __syncthreads() barriers; that is, some threads execute code that is after a barrier before other threads have reached the barrier.
To try and detect when __syncthreads() errors might exist in a program, I’ve written a simple macro and helper function that should act like a tripwire to detect when program behavior has become undefined in the way I just mentioned. I’d appreciate any feedback on it-- is it correct? Is it useful? I’ve tested it in a few applications and it seems to do what I want, but I could use a second opinion.
Here it is:
#define USE_MY_SYNCTHREADS 1 // toggle between error-detecting and builtin __syncthreads()
device int mySyncthreads()
{
shared int syncCount;
syncCount = 0;
__syncthreads();
atomicAdd(&syncCount, 1);
__syncthreads();
return syncCount;
}
#if USE_MY_SYNCTHREADS
#define __syncthreads() assert(mySyncthreads() == blockDim.x)
#endif
The mySyncthreads() function creates a shared int, initializes it to zero, and allows each thread to atomically increment that int before returning its value. The macro redefines all __syncthreads() calls in the program to call mySyncthreads() and fail if the value returned does not equal the number of threads per block. Because macro substitutions are non-recursive, the __syncthreads() calls inside the mySyncthreads() function use the builtin CUDA __syncthreads() function rather than the macro-ized one (if this were not the case, the compiler would probably get stuck in an infinite loop). The USE_MY_SYNCTHREADS definition allows toggling between the custom and builtin __syncthreads() functions, assuming this code is in a header file included by all kernel functions.
Thanks!