Cooperative Group Grid synchronization leading to execution freezes

Hello,

I am attempting to convert a series of small kernels into one larger kernel, and I was planning on using the cooperative group grid synchronization mechanism to ensure proper reads/writes to global memory. The kernel passes the test suite, however when attempting to run with real-world data, the program seems to lock up. I was able to narrow it down to a call to sync the grid.

Here is what my code looks like:

extern __shared__ data shared[];
const int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid >= array_length) return;

cg::grid_group grid_group = cg::this_grid();   
shared[threadIdx.x] = global[tid];
doSomeWork(shared);
global[tid] = shared[threadIdx.x];
grid_group.sync();
doSomeMoreWork(global);

The freeze doesn’t occur every kernel invocation, but it seems to be deterministic given it freezes at the same point consistently. I am invoking the kernel through cudaLaunchCooperativeKernel() with 256 threads per block, with an average of 10 blocks. One fix I did try was to ensure that the number of blocks called was guaranteed to align with the number of blocks per SM that my device can support as given by cudaOccupancyMaxActiveBlocksPerMultiprocessor, though this only lead to the kernel freezing during the unit tests.

I’m running CUDA 12.2 on a RTX4090 with driver version 550.67.

2 Likes

All threads should call grid.sync(). Does this work?

extern __shared__ data shared[];
cg::grid_group grid_group = cg::this_grid();   
const int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < array_length){

    shared[threadIdx.x] = global[tid];
    doSomeWork(shared);
    global[tid] = shared[threadIdx.x];
}
grid_group.sync();

if (tid < array_length){
    doSomeMoreWork(global);
}

3 Likes

Thank you!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.