Grid-level __syncthreads A Global Barrier

I’m looking for a (potentially) better way to do grid-level synchronization. Especially since I don’t know if the method I devised will actually hang if the number of blocks being used becomes too large. I don’t care about the speed as this is not horrifically slow, I’m just concerned about there being a simpler/better/safer way to do this.

The method I’m using is simple, have every thread call a barrier function. Run syncthreads at the beginning and end. For each primary thread in a block (the first thread, threadIdx == { 0, 0, 0 }), change a block-specific value stored in global memory from 0 to 1. The master thread in the grid (primary thread of blockIdx { 0, 0 }) continuously counts the number of blocks that have set this value. Once all the blocks are waiting, drop the barrier and let them all through. Any suggestions?

And yes, I’ve seen…t=&#entry279524 and have been told that it is reliant upon assumed behavior of the hardware that can change.

// barriers all threads in the grid. for simplicity, the grid and all blocks

// <b>MUST</b> be *1* dimensional.

// @param mutexVals An array of integers in global memory of size gridDim.x. all values must be initially set to zero.

__device__ void localBarrier(volatile unsigned int * mutexVals)


  __syncthreads(); // sync up all block threads to this point.

 if (threadIdx.x == 0) // only let the primary block thread in from here/


    mutexVals[blockIdx.x] = 1; // tell the primary grid thread that this block is ready.

    if (blockIdx.x == 0) // only let the master thread through this point.


      int count;



        // count up the number of blocks that are waiting.

        count = 0;

        for (int i = 0; i < gridDim.x; ++i) if (mutexVals[i] == 1) ++count;


      while (count < gridDim.x); // this will be true when all blocks are waiting.

      for (int i = 0; i < gridDim.x; ++i) mutexVals[i] = 0; // let all blocks through the barrier.


    // keep sitting idle until we're allowed through.

    // it would be nice to have a built-in method to allow for this block to be

    // scheduled out, as this would guarantee  against starvation and resource

    // hogging by a waiting block.

    while (mutexVals[blockIdx.x] == 1) { } 




This might work, but is this really faster than grid-level synchronisation on the host? (ie, call the kernel multiple times)

For the purposes of what I’m doing, I have persistent blocks, and thus there is no way to keep calling the kernel from the host. I got feedback from some NVIDIA employees who stated that persistent blocks can be slower than scheduling thousands of blocks at once, but the paradigm I’m using forces persistent blocks.