Custom __syncthreads() with error detection?

I’ve noticed that when __syncthreads() rules are broken (e.g., either all threads or no threads must reach every __syncthreads() call), program behavior becomes undefined. This can manifest itself as threads “breaking through” __syncthreads() barriers; that is, some threads execute code that is after a barrier before other threads have reached the barrier.

To try and detect when __syncthreads() errors might exist in a program, I’ve written a simple macro and helper function that should act like a tripwire to detect when program behavior has become undefined in the way I just mentioned. I’d appreciate any feedback on it-- is it correct? Is it useful? I’ve tested it in a few applications and it seems to do what I want, but I could use a second opinion.

Here it is:

#define USE_MY_SYNCTHREADS 1 // toggle between error-detecting and builtin __syncthreads()

device int mySyncthreads()
{
shared int syncCount;
syncCount = 0;

__syncthreads();

atomicAdd(&syncCount, 1);

__syncthreads();

return syncCount;
}

#if USE_MY_SYNCTHREADS
#define __syncthreads() assert(mySyncthreads() == blockDim.x)
#endif

The mySyncthreads() function creates a shared int, initializes it to zero, and allows each thread to atomically increment that int before returning its value. The macro redefines all __syncthreads() calls in the program to call mySyncthreads() and fail if the value returned does not equal the number of threads per block. Because macro substitutions are non-recursive, the __syncthreads() calls inside the mySyncthreads() function use the builtin CUDA __syncthreads() function rather than the macro-ized one (if this were not the case, the compiler would probably get stuck in an infinite loop). The USE_MY_SYNCTHREADS definition allows toggling between the custom and builtin __syncthreads() functions, assuming this code is in a header file included by all kernel functions.

Thanks!

A tip - assert(mySyncthreads() == blockDim.x) will fail when there is more then one dimention of threads in block (in other words when You have 2D or 3D block).

And a question - why do You need the first __syncthreads() before atomicAdd? Wouldn’t it work without it too?

MK

Good point about the 1D threads/block requirement; I’ve been working with a 1D application and didn’t think to generalize the function.

As for the first __syncthreads(), my thought was that it’s required to ensure that no thread can set the value of syncCount to 0 after another thread has already executed the atomic add instruction. This would happen if all warps executed the instruction to create syncCount, but then one warp executed until after it incremented syncCount before a different warp could execute the assignment.