Possible race in CUDA Cooperative Groups

I think I’ve noticed a race in the Cooperative Groups implementation of grid-level synchronization and wanted to confirm this with NVIDIA developers.

In /include/cooperative_groups/details/sync.h the function sync_grids performs this task.
Below is that function:

_CG_STATIC_QUALIFIER void sync_grids(unsigned int expected, volatile unsigned int *arrived) {
  bool cta_master = (threadIdx.x + threadIdx.y + threadIdx.z == 0);
  bool gpu_master = (blockIdx.x + blockIdx.y + blockIdx.z == 0);


  if (cta_master) {
    unsigned int nb = 1;
    if (gpu_master) {
        nb = 0x80000000 - (expected - 1);
    unsigned int oldArrive;
    oldArrive = atomic_add(arrived, nb);
    while (!bar_has_flipped(oldArrive, *arrived));
    //flush barrier upon leaving
    bar_flush((unsigned int*)arrived);

One thing that can be clearly noticed is, only one thread per block performs any form of memory fence, while the other threads only perform syncthreads.

From the CUDA programming guide:

void __threadfence() acts as __threadfence_block() for all threads in the block of the calling thread and also ensures that no writes to all memory made by the calling thread after the call to __threadfence() are observed by any thread in the device as occurring before any write to all memory made by the calling thread before the call to __threadfence().

As the other threads in the block do not call any fence operations, per the CUDA guide, there are no memory consistency guarantees for previous memory operations at the device scope for these threads.

I think that the threadfence inside the if condition should be moved outside so that this is formally guaranteed.

I wanted to confirm whether this was indeed a race, as I’ve described.

Does the documentation specify that grid::sync acts as a memory fence on the device?

It is the case for __syncthreads() (blockwide) and __syncwarp() (warpwide), and according to a source code comment, thread_block::sync is equivalent to __syncthreads()

The documentation strongly implies that it should act as a memory fence.
Quoting the explanation for grid-level synchronization:

"Prior to the introduction of Cooperative Groups, the CUDA programming model only allowed synchronization between thread blocks at a kernel completion boundary. The kernel boundary carries with it an implicit invalidation of state, and with it, potential performance implications.

For example, in certain use cases, applications have a large number of small kernels, with each kernel representing a stage in a processing pipeline. The presence of these kernels is required by the current CUDA programming model to ensure that the thread blocks operating on one pipeline stage have produced data before the thread block operating on the next pipeline stage is ready to consume it. In such cases, the ability to provide global inter thread block synchronization would allow the application to be restructured to have persistent thread blocks, which are able to synchronize on the device when a given stage is complete."

If device-scope memory fence semantics are not guaranteed, this transfer of data between stages would also not be guaranteed.

I noticed that the CUDA sample conjugateGradientMultiBlockCG makes use of this grid-level synchronization.

However, if memory fence semantics are not guaranteed by the grid_sync, then this program has data races, as reads/writes happen to the same location from different threads without any other sync mechanism in between.

The observation about the sample codes should be instructive: they are intended to be representative of proper programming.

I’ve filed an internal bug at NVIDIA to have the documentation enhanced with respect to this. I don’t know when it will be acted upon. I’m unlikely to be able to respond to further inquiries about this.