Cuda graphs issue when updating kernel node dynamic shared memory size - Cooperative group synchronization

I’m running an application with different image stacks of different shapes (H, W, D) and I’m creating a CudaGraph only once per the runtime for these different inputs using cudaGraphCreate and cudaGraphInstantiate where I add the graph nodes and the kernel parameters.

There is an issue when I’m trying to update the dynamic shared memory size in the cudaKernelNodeParams. Inside the same kernel where I need to use the dynamic shared memory, the cooperative group is getting stuck in synchronization. Specifically at this callstack:

>	[CUDA]kernels.obj !ld_acquire_cta Line 160 [0x0000001300cd48e0]	
 	[CUDA]kernels.obj !barrier_wait Line 177 [0x0000001300cd4920]	
 	[CUDA]kernels.obj !sync_warps Line 195 [0x0000001300cd4980]	
 	[CUDA]kernels.obj !sync Line 1458 [0x0000001300cd4a10]	
 	[CUDA]kernels.obj !operator() Line 339 [0x0000001300cd4a40]	

From within the sync.h

// Read the barrier, acquire to ensure all memory operations following the sync are correctly performed after it is released
_CG_STATIC_QUALIFIER unsigned int ld_acquire_cta(unsigned int *addr) {
    unsigned int val;
NV_IF_ELSE_TARGET(NV_PROVIDES_SM_70,
    (asm volatile("ld.acquire.cta.u32 %0,[%1];" : "=r"(val) : _CG_ASM_PTR_CONSTRAINT(addr) : "memory");)
    ,
    (val = *((volatile unsigned int*) addr);
    __threadfence_block();)
    );
    return val;
}

It’s worth noting the following behavior:

  1. The application works fine only when I’m running 1 image stack and the result is correct; even on the one that was stuck.
  2. When the application runs multiple image stacks over the same pipeline, the issue doesn’t happen when I’m allocating large amount of dynamic shared memory than needed i.e. 16KB. The results per image stack was also correct.
  3. The issue persists if I try to change the dynamic shmem size after the 1st image stack.
  4. Compute sanitizer doesn’t report any memory check errors or possible race conditions that will affect this synchronization as it literally happens at the beginning of the kernel. It also gets stuck under the same runtime conditions

I always do the following after updating the kernel node parameters:

  1. cudaGraphExecKernelNodeSetParams
  2. cudaGraphLaunch

CUDA version:

Built on Wed_Jan_15_19:38:46_Pacific_Standard_Time_2025
Cuda compilation tools, release 12.8, V12.8.61
Build cuda_12.8.r12.8/compiler.35404655_0

GPU:

Device 0: NVIDIA RTX 2000 Ada Generation
Compute Capability:           8.9
CUDA Cores/SM (approx):       22 (SMs, cores vary by architecture)

Total Global Memory:          16379 MB
Shared Memory per Block:      48 KB
Shared Memory per SM:         100 KB
L2 Cache Size:                24576 KB
Registers per Block:          65536
Registers per SM:             65536

Is there a restriction where I can’t change the dynamic shared memory size once the graph finished execution? specially with a kernel that uses cooperative groups.

Does the kernel work without cuda graph?

Do all threads in the synchronizing group call group.sync() or are there some threads which do not?

If by cooperative group synchronization you mean grid-level synchronization, there is one issue that I can think of. For grid-level synchronization one has to make sure that all thread blocks fit on the SMs at the same time. Changing the dynamic shared memory size can alter the maximum number of resident threads blocks, which can lead to a dead lock if it is less than the number of launched blocks. You might have to update both the shared memory size and the number of blocks.

Does the kernel work without cuda graph?

Yes, and also with cuda graph if only the dynamic shared memory never updated or changed.

Do all threads in the synchronizing group call group.sync() or are there some threads which do not?

Not sure how to check that. I tried to disable the entire functionality inside the kernel and I attached the compute-sanitizer output in this reply.

If by cooperative group synchronization you mean grid-level synchronization

The kernel never does a grid-level synchronization, only using:

auto tile64 = cg::tiled_partition<64>(currentThreadBlock);
tile64.sync();

Changing the dynamic shared memory size can alter the maximum number of resident threads blocks, which can lead to a dead lock if it is less than the number of launched blocks

I do launch many blocks, but each block has only 2 warps for the current design. What is upper limit for the amount of blocks launched that comes from the relationship between the number of blocks to launch and the shmem size needed per block? The shmem size never exceeded even half the available per block according to cudaGetDeviceProperties

The number of blocks fits within the maximum grid size in all directions

----- Block & Grid Limits -----
Max Block Dimensions:         (1024, 1024, 64)
Max Grid Dimensions:          (2147483647, 65535, 65535)

Updated output from compute sanitizer running memcheck after disabling the entire functionality inside the kernel and only leaving the cooperative group partioining:

`========= Invalid ` **`shared`** ` atomic of size 4 bytes`
`=========     at cooperative_groups::__v1::__static_size_multi_warp_tile_base<(unsigned int)64>::__static_size_multi_warp_tile_base<cooperative_groups::__v1::thread_block>(const T1 &) [subobject]+0x17b0`
`=========     by thread (0,32,0) in block (62,0,0)`
`=========     Address 0x2a80 is out of bounds`
`=========         Device Frame: cooperative_groups::__v1::__multi_warp_thread_block_tile<(unsigned int)64, cooperative_groups::__v1::thread_block>::__multi_warp_thread_block_tile(const cooperative_groups::__v1::thread_block &) [subobject]+0x390`
`=========         Device Frame: cooperative_groups::__v1::details::thread_block_tile_impl<(unsigned int)64, cooperative_groups::__v1::thread_block, (bool)1>::thread_block_tile_impl<cooperative_groups::__v1::thread_block>(const T1 &) [subobject]+0x390`
`=========         Device Frame: cooperative_groups::__v1::thread_block_tile<(unsigned int)64, cooperative_groups::__v1::thread_block>::thread_block_tile(const cooperative_groups::__v1::thread_block &) [subobject]+0x390`
`=========         Device Frame: tiled_partition_impl+0x3510 in cooperative_groups.h:1597`
`=========         Device Frame: tiled_partition<64U,thread_block>+0x3520 in cooperative_groups.h:1616`
`=========         Device Frame: operator()+0x3590 in BlockXPixelPipeline.hpp:151`
`=========         Device Frame: TravHWC(unsigned int, const unsigned short *, const unsigned long long *, const unsigned long long *, const double *, const double *, const double *, int *, int *, unsigned char *, unsigned int, unsigned int, unsigned int, unsigned int, double, unsigned int, unsigned int, TravInfo)+0xfee0 in kernels.cu:146`
`=========     Saved host backtrace up to driver entry point at kernel launch time`
`=========         Host Frame: cudaGraphLaunch [0x491bd] in FittingDLL.dll`
`=========         Host Frame: H_Stream::operator() in H_Stream.cu:132 [0x2a2e2] in FittingDLL.dll`
`=========         Host Frame: H_GraphManager::operator() in H_GraphManager.cu:250 [0x2d02b] in FittingDLL.dll`
`=========         Host Frame: GPU_Execute in exportDLL.cu:150 [0x2f2a4] in FittingDLL.dll`
`=========         Host Frame:  [0x7ff7d57ef097] in`

The number of blocks is only relevant for grid sync. You do not need to worry about it if only thread block tiles are used.

Are you able to share the code which produces the invalid shared atomic?

Are you able to share the code which produces the invalid shared atomic?

Unfortunately, I can’t. However, the pseudo code flows as follow:

  1. Create and instantiate the graph
  2. Update kernel node parameters (including the shmem size)
  3. Launch the graph for execution
  4. The kernel is only dividing the amount of threads in the block into groups of 64 threads, then further grouping into 32 threads. (This works fine for the 1st image stack but fails for the next one)
  5. Once the application finished execution, update the shmem size again to be greater/lower than the previous size for the same graph. (This gave the output from compute-sanitizer)
  6. Failure in the cg::tiled_partition<64>(currentThreadBlock);

The kernel at the moment only have the partitioning into 64 tiles and then further partitioning into 2 warps, then synchronize for the 64 tiles.

My launch parameters:

  • block dimension (1, 64, 1) - hard configured for the moment and never changes
  • shmem size ranges between: 2 KB to 16 KB maximum.

I managed to make a version that I can share that can replicate the issue by changing the shmem size:

main.txt (2.6 KB)

and here is the code for quick scanning:

#include <assert.h>
#include <climits>
#include <stdio.h>
#include <vector>

#include <cuda_runtime.h>
#include <cooperative_groups.h>

namespace cg = cooperative_groups;

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess)
   {
       fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);

       if (abort) exit(code);
   }
}

__global__ void squareArray(uint32_t s)
{
    extern __shared__ int shared_data[];
    cg::thread_block currentThreadBlock = cg::this_thread_block();
    auto tile64 = cg::tiled_partition<64>(currentThreadBlock);

    if (tile64.thread_rank() == 0) {
        shared_data[0] = s;
    }

    tile64.sync();

    auto warpTile32 = cg::tiled_partition<32>(currentThreadBlock);
    warpTile32.sync();
    tile64.sync();

}

int main(int argc, char **argv)
{
    cudaStream_t    stream;
    cudaGraph_t graph;
    cudaGraphExec_t graphExec;
    cudaGraphNode_t squareKernelNode;
    cudaKernelNodeParams kernelNodeParams = {0};
    uint32_t s = 123;
    void *squareKernelArgs[1] = {&s};

    gpuErrchk(cudaGraphCreate(&graph, 0));
    gpuErrchk(cudaStreamCreate(&stream));

    kernelNodeParams.func = (void *)squareArray;
    kernelNodeParams.gridDim = dim3(1024, 1, 1);
    kernelNodeParams.blockDim = dim3(128, 1, 1);
    kernelNodeParams.sharedMemBytes = 2048;
    kernelNodeParams.kernelParams = (void **)squareKernelArgs;

    gpuErrchk(cudaGraphAddKernelNode(&squareKernelNode, graph, nullptr, 0, &kernelNodeParams));
    gpuErrchk(cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0));

    gpuErrchk(cudaGraphExecKernelNodeSetParams(graphExec, squareKernelNode, &kernelNodeParams));
    gpuErrchk(cudaGraphLaunch(graphExec, stream));
    gpuErrchk(cudaStreamSynchronize(stream));

    // Second launch (Updated shared memory)
    kernelNodeParams.gridDim = dim3(1024, 1, 1);
    kernelNodeParams.sharedMemBytes = 1024;     // FIX: REPLACE WITH 2048 - same as the previous graph launch
    gpuErrchk(cudaGraphExecKernelNodeSetParams(graphExec, squareKernelNode, &kernelNodeParams));
    gpuErrchk(cudaGraphLaunch(graphExec, stream));
    gpuErrchk(cudaStreamSynchronize(stream));

    gpuErrchk(cudaStreamDestroy(stream));
    gpuErrchk(cudaGraphExecDestroy(graphExec));

    printf("Graph executions completed.\n");
    return 0;
}

Compute sanitizer error:

========= Invalid shared atomic of size 4 bytes
=========     at cooperative_groups::__v1::details::red_and_release_cta(unsigned int *, unsigned int)+0x220 in sync.h:149
=========     by thread (96,0,0) in block (47,0,0)
=========     Address 0x800 is out of bounds
=========         Device Frame: cooperative_groups::__v1::details::sync_warps_reset(unsigned int *, unsigned int)+0x1c0 in sync.h:270
=========         Device Frame: cooperative_groups::__v1::__static_size_multi_warp_tile_base<(unsigned int)64>::__static_size_multi_warp_tile_base<cooperative_groups::__v1::thread_block>(const T1 &)+0x30 in cooperative_groups.h:1365
=========         Device Frame: cooperative_groups::__v1::__multi_warp_thread_block_tile<(unsigned int)64, cooperative_groups::__v1::thread_block>::__multi_warp_thread_block_tile(const cooperative_groups::__v1::thread_block &)+0x10 in cooperative_groups.h:1494
=========         Device Frame: cooperative_groups::__v1::details::thread_block_tile_impl<(unsigned int)64, cooperative_groups::__v1::thread_block, (bool)1>::thread_block_tile_impl<cooperative_groups::__v1::thread_block>(const T1 &)+0x10 in cooperative_groups.h:1545
=========         Device Frame: cooperative_groups::__v1::thread_block_tile<(unsigned int)64, cooperative_groups::__v1::thread_block>::thread_block_tile(const cooperative_groups::__v1::thread_block &)+0x10 in cooperative_groups.h:1565
=========         Device Frame: cooperative_groups::__v1::details::tiled_partition_impl<(unsigned int)64, cooperative_groups::__v1::thread_block>::tiled_partition_impl(const cooperative_groups::__v1::thread_block &)+0x10 in cooperative_groups.h:1597
=========         Device Frame: cooperative_groups::__v1::thread_block_tile<T1, T2> cooperative_groups::__v1::tiled_partition<(unsigned int)64, cooperative_groups::__v1::thread_block>(const T2 &)+0x10 in cooperative_groups.h:1616
=========         Device Frame: squareArray(unsigned int)+0x10 in main.cu:29
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame: cudaGraphLaunch [0x4847d] in Fitting.exe
=========         Host Frame: main in main.cu:73 [0x352ec] in Fitting.exe

Graph executions completed.
========= LEAK SUMMARY: 0 bytes leaked in 0 allocations
========= ERROR SUMMARY: 4096 errors