Cooperative Group Grid synchronization leading to execution freezes

chandler.w.carlin · June 21, 2024, 8:05pm

Hello,

I am attempting to convert a series of small kernels into one larger kernel, and I was planning on using the cooperative group grid synchronization mechanism to ensure proper reads/writes to global memory. The kernel passes the test suite, however when attempting to run with real-world data, the program seems to lock up. I was able to narrow it down to a call to sync the grid.

Here is what my code looks like:

extern __shared__ data shared[];
const int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid >= array_length) return;

cg::grid_group grid_group = cg::this_grid();   
shared[threadIdx.x] = global[tid];
doSomeWork(shared);
global[tid] = shared[threadIdx.x];
grid_group.sync();
doSomeMoreWork(global);

The freeze doesn’t occur every kernel invocation, but it seems to be deterministic given it freezes at the same point consistently. I am invoking the kernel through cudaLaunchCooperativeKernel() with 256 threads per block, with an average of 10 blocks. One fix I did try was to ensure that the number of blocks called was guaranteed to align with the number of blocks per SM that my device can support as given by cudaOccupancyMaxActiveBlocksPerMultiprocessor, though this only lead to the kernel freezing during the unit tests.

I’m running CUDA 12.2 on a RTX4090 with driver version 550.67.

striker159 · June 22, 2024, 6:37am

All threads should call grid.sync(). Does this work?

extern __shared__ data shared[];
cg::grid_group grid_group = cg::this_grid();   
const int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < array_length){

    shared[threadIdx.x] = global[tid];
    doSomeWork(shared);
    global[tid] = shared[threadIdx.x];
}
grid_group.sync();

if (tid < array_length){
    doSomeMoreWork(global);
}

brian.a.paden · June 24, 2024, 5:01pm

Thank you!

system · July 8, 2024, 5:01pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cooperative_groups::this_grid() is not valid on my Volta architecture GPU. How to globally synchronize CUDA Programming and Performance cuda	3	243	June 4, 2024
Cuda global reads/writes in cooperative kernel CUDA Programming and Performance cuda , kernel , synchronization	2	958	October 12, 2021
Cooperative groups grid sync + global write issue CUDA Programming and Performance	4	2129	February 12, 2019
Synchronizing cooperative groups inside a while loop CUDA NVCC Compiler cuda	0	506	May 18, 2023
Can I use cooperative_groups::sync(grid) in child kernel (CUDA dynamic parallelism)? CUDA Programming and Performance	1	539	January 24, 2022
Does the grid_sync in cooperative groups have the same functionality as the device-wide synchronization? CUDA Programming and Performance	11	2278	March 20, 2024
Cuda Grid Synchronization (cudaLaunchCooperativeKernel) - will global memory calls be shared between SMs? CUDA Programming and Performance	1	579	February 2, 2018
Performance of cooperative thread groups' grid sync vs atomics based grid sync CUDA Programming and Performance	0	420	January 6, 2018
Global Barrier synchronization CUDA Programming and Performance	18	1370	May 14, 2024
CUDA Fortran - global synchronization Legacy PGI Compilers	19	5967	February 11, 2019

Cooperative Group Grid synchronization leading to execution freezes

Related topics