Possible race in CUDA Cooperative Groups

AKKamath · November 28, 2020, 5:12pm

I think I’ve noticed a race in the Cooperative Groups implementation of grid-level synchronization and wanted to confirm this with NVIDIA developers.

In /include/cooperative_groups/details/sync.h the function sync_grids performs this task.
Below is that function:

_CG_STATIC_QUALIFIER void sync_grids(unsigned int expected, volatile unsigned int *arrived) {
  bool cta_master = (threadIdx.x + threadIdx.y + threadIdx.z == 0);
  bool gpu_master = (blockIdx.x + blockIdx.y + blockIdx.z == 0);

  __syncthreads();

  if (cta_master) {
    unsigned int nb = 1;
    if (gpu_master) {
        nb = 0x80000000 - (expected - 1);
    }
    __threadfence();
    unsigned int oldArrive;
    oldArrive = atomic_add(arrived, nb);
    while (!bar_has_flipped(oldArrive, *arrived));
    //flush barrier upon leaving
    bar_flush((unsigned int*)arrived);
  }
  __syncthreads();
}

One thing that can be clearly noticed is, only one thread per block performs any form of memory fence, while the other threads only perform syncthreads.

From the CUDA programming guide:

void __threadfence() acts as __threadfence_block() for all threads in the block of the calling thread and also ensures that no writes to all memory made by the calling thread after the call to __threadfence() are observed by any thread in the device as occurring before any write to all memory made by the calling thread before the call to __threadfence().

As the other threads in the block do not call any fence operations, per the CUDA guide, there are no memory consistency guarantees for previous memory operations at the device scope for these threads.

I think that the threadfence inside the if condition should be moved outside so that this is formally guaranteed.

I wanted to confirm whether this was indeed a race, as I’ve described.

striker159 · November 30, 2020, 7:05am

Does the documentation specify that grid::sync acts as a memory fence on the device?

It is the case for __syncthreads() (blockwide) and __syncwarp() (warpwide), and according to a source code comment, thread_block::sync is equivalent to __syncthreads()

AKKamath · November 30, 2020, 7:25am

The documentation strongly implies that it should act as a memory fence.
Quoting the explanation for grid-level synchronization:

"Prior to the introduction of Cooperative Groups, the CUDA programming model only allowed synchronization between thread blocks at a kernel completion boundary. The kernel boundary carries with it an implicit invalidation of state, and with it, potential performance implications.

For example, in certain use cases, applications have a large number of small kernels, with each kernel representing a stage in a processing pipeline. The presence of these kernels is required by the current CUDA programming model to ensure that the thread blocks operating on one pipeline stage have produced data before the thread block operating on the next pipeline stage is ready to consume it. In such cases, the ability to provide global inter thread block synchronization would allow the application to be restructured to have persistent thread blocks, which are able to synchronize on the device when a given stage is complete."

If device-scope memory fence semantics are not guaranteed, this transfer of data between stages would also not be guaranteed.

AKKamath · December 9, 2020, 7:39am

I noticed that the CUDA sample conjugateGradientMultiBlockCG makes use of this grid-level synchronization.

However, if memory fence semantics are not guaranteed by the grid_sync, then this program has data races, as reads/writes happen to the same location from different threads without any other sync mechanism in between.

Robert_Crovella · December 9, 2020, 2:29pm

The observation about the sample codes should be instructive: they are intended to be representative of proper programming.

I’ve filed an internal bug at NVIDIA to have the documentation enhanced with respect to this. I don’t know when it will be acted upon. I’m unlikely to be able to respond to further inquiries about this.

Topic		Replies	Views
About __threadfence... CUDA Programming and Performance	1	1311	March 11, 2010
Cooperative groups grid sync + global write issue CUDA Programming and Performance	4	2129	February 12, 2019
Cooperative Group Grid synchronization leading to execution freezes CUDA Programming and Performance cuda	3	378	July 8, 2024
syncronize all threads from all blocks cudaThreadSynchronize() the only way ? CUDA Programming and Performance	11	8389	November 15, 2010
Question related __threadfence CUDA Programming and Performance	13	5251	January 12, 2016
Cuda global reads/writes in cooperative kernel CUDA Programming and Performance cuda , kernel , synchronization	2	956	October 12, 2021
Performance of cooperative thread groups' grid sync vs atomics based grid sync CUDA Programming and Performance	0	420	January 6, 2018
How to syncronize across blocks in CUDA Fortran Legacy PGI Compilers	3	11289	August 16, 2010
Thread groups out of the active thread blocks CUDA Programming and Performance	1	351	November 19, 2020
is there any function to do sync threads in a grid? CUDA Programming and Performance	2	2500	March 30, 2015

Possible race in CUDA Cooperative Groups

Related topics