Cuda graphs issue when updating kernel node dynamic shared memory size - Cooperative group synchronization

I’m running an application with different image stacks of different shapes (H, W, D) and I’m creating a CudaGraph only once per the runtime for these different inputs using cudaGraphCreate and cudaGraphInstantiate where I add the graph nodes and the kernel parameters.

There is an issue when I’m trying to update the dynamic shared memory size in the cudaKernelNodeParams. Inside the same kernel where I need to use the dynamic shared memory, the cooperative group is getting stuck in synchronization. Specifically at this callstack:

>	[CUDA]kernels.obj !ld_acquire_cta Line 160 [0x0000001300cd48e0]	
 	[CUDA]kernels.obj !barrier_wait Line 177 [0x0000001300cd4920]	
 	[CUDA]kernels.obj !sync_warps Line 195 [0x0000001300cd4980]	
 	[CUDA]kernels.obj !sync Line 1458 [0x0000001300cd4a10]	
 	[CUDA]kernels.obj !operator() Line 339 [0x0000001300cd4a40]	

From within the sync.h

// Read the barrier, acquire to ensure all memory operations following the sync are correctly performed after it is released
_CG_STATIC_QUALIFIER unsigned int ld_acquire_cta(unsigned int *addr) {
    unsigned int val;
NV_IF_ELSE_TARGET(NV_PROVIDES_SM_70,
    (asm volatile("ld.acquire.cta.u32 %0,[%1];" : "=r"(val) : _CG_ASM_PTR_CONSTRAINT(addr) : "memory");)
    ,
    (val = *((volatile unsigned int*) addr);
    __threadfence_block();)
    );
    return val;
}

It’s worth noting the following behavior:

  1. The application works fine only when I’m running 1 image stack and the result is correct; even on the one that was stuck.
  2. When the application runs multiple image stacks over the same pipeline, the issue doesn’t happen when I’m allocating large amount of dynamic shared memory than needed i.e. 16KB. The results per image stack was also correct.
  3. The issue persists if I try to change the dynamic shmem size after the 1st image stack.
  4. Compute sanitizer doesn’t report any memory check errors or possible race conditions that will affect this synchronization as it literally happens at the beginning of the kernel. It also gets stuck under the same runtime conditions

I always do the following after updating the kernel node parameters:

  1. cudaGraphExecKernelNodeSetParams
  2. cudaGraphLaunch

CUDA version:

Built on Wed_Jan_15_19:38:46_Pacific_Standard_Time_2025
Cuda compilation tools, release 12.8, V12.8.61
Build cuda_12.8.r12.8/compiler.35404655_0

GPU:

Device 0: NVIDIA RTX 2000 Ada Generation
Compute Capability:           8.9
CUDA Cores/SM (approx):       22 (SMs, cores vary by architecture)

Total Global Memory:          16379 MB
Shared Memory per Block:      48 KB
Shared Memory per SM:         100 KB
L2 Cache Size:                24576 KB
Registers per Block:          65536
Registers per SM:             65536

Is there a restriction where I can’t change the dynamic shared memory size once the graph finished execution? specially with a kernel that uses cooperative groups.

Does the kernel work without cuda graph?

Do all threads in the synchronizing group call group.sync() or are there some threads which do not?

If by cooperative group synchronization you mean grid-level synchronization, there is one issue that I can think of. For grid-level synchronization one has to make sure that all thread blocks fit on the SMs at the same time. Changing the dynamic shared memory size can alter the maximum number of resident threads blocks, which can lead to a dead lock if it is less than the number of launched blocks. You might have to update both the shared memory size and the number of blocks.

Does the kernel work without cuda graph?

Yes, and also with cuda graph if only the dynamic shared memory never updated or changed.

Do all threads in the synchronizing group call group.sync() or are there some threads which do not?

Not sure how to check that. I tried to disable the entire functionality inside the kernel and I attached the compute-sanitizer output in this reply.

If by cooperative group synchronization you mean grid-level synchronization

The kernel never does a grid-level synchronization, only using:

auto tile64 = cg::tiled_partition<64>(currentThreadBlock);
tile64.sync();

Changing the dynamic shared memory size can alter the maximum number of resident threads blocks, which can lead to a dead lock if it is less than the number of launched blocks

I do launch many blocks, but each block has only 2 warps for the current design. What is upper limit for the amount of blocks launched that comes from the relationship between the number of blocks to launch and the shmem size needed per block? The shmem size never exceeded even half the available per block according to cudaGetDeviceProperties

The number of blocks fits within the maximum grid size in all directions

----- Block & Grid Limits -----
Max Block Dimensions:         (1024, 1024, 64)
Max Grid Dimensions:          (2147483647, 65535, 65535)

Updated output from compute sanitizer running memcheck after disabling the entire functionality inside the kernel and only leaving the cooperative group partioining:

`========= Invalid ` **`shared`** ` atomic of size 4 bytes`
`=========     at cooperative_groups::__v1::__static_size_multi_warp_tile_base<(unsigned int)64>::__static_size_multi_warp_tile_base<cooperative_groups::__v1::thread_block>(const T1 &) [subobject]+0x17b0`
`=========     by thread (0,32,0) in block (62,0,0)`
`=========     Address 0x2a80 is out of bounds`
`=========         Device Frame: cooperative_groups::__v1::__multi_warp_thread_block_tile<(unsigned int)64, cooperative_groups::__v1::thread_block>::__multi_warp_thread_block_tile(const cooperative_groups::__v1::thread_block &) [subobject]+0x390`
`=========         Device Frame: cooperative_groups::__v1::details::thread_block_tile_impl<(unsigned int)64, cooperative_groups::__v1::thread_block, (bool)1>::thread_block_tile_impl<cooperative_groups::__v1::thread_block>(const T1 &) [subobject]+0x390`
`=========         Device Frame: cooperative_groups::__v1::thread_block_tile<(unsigned int)64, cooperative_groups::__v1::thread_block>::thread_block_tile(const cooperative_groups::__v1::thread_block &) [subobject]+0x390`
`=========         Device Frame: tiled_partition_impl+0x3510 in cooperative_groups.h:1597`
`=========         Device Frame: tiled_partition<64U,thread_block>+0x3520 in cooperative_groups.h:1616`
`=========         Device Frame: operator()+0x3590 in BlockXPixelPipeline.hpp:151`
`=========         Device Frame: TravHWC(unsigned int, const unsigned short *, const unsigned long long *, const unsigned long long *, const double *, const double *, const double *, int *, int *, unsigned char *, unsigned int, unsigned int, unsigned int, unsigned int, double, unsigned int, unsigned int, TravInfo)+0xfee0 in kernels.cu:146`
`=========     Saved host backtrace up to driver entry point at kernel launch time`
`=========         Host Frame: cudaGraphLaunch [0x491bd] in FittingDLL.dll`
`=========         Host Frame: H_Stream::operator() in H_Stream.cu:132 [0x2a2e2] in FittingDLL.dll`
`=========         Host Frame: H_GraphManager::operator() in H_GraphManager.cu:250 [0x2d02b] in FittingDLL.dll`
`=========         Host Frame: GPU_Execute in exportDLL.cu:150 [0x2f2a4] in FittingDLL.dll`
`=========         Host Frame:  [0x7ff7d57ef097] in`

The number of blocks is only relevant for grid sync. You do not need to worry about it if only thread block tiles are used.

Are you able to share the code which produces the invalid shared atomic?

Are you able to share the code which produces the invalid shared atomic?

Unfortunately, I can’t. However, the pseudo code flows as follow:

  1. Create and instantiate the graph
  2. Update kernel node parameters (including the shmem size)
  3. Launch the graph for execution
  4. The kernel is only dividing the amount of threads in the block into groups of 64 threads, then further grouping into 32 threads. (This works fine for the 1st image stack but fails for the next one)
  5. Once the application finished execution, update the shmem size again to be greater/lower than the previous size for the same graph. (This gave the output from compute-sanitizer)
  6. Failure in the cg::tiled_partition<64>(currentThreadBlock);

The kernel at the moment only have the partitioning into 64 tiles and then further partitioning into 2 warps, then synchronize for the 64 tiles.

My launch parameters:

  • block dimension (1, 64, 1) - hard configured for the moment and never changes
  • shmem size ranges between: 2 KB to 16 KB maximum.

I managed to make a version that I can share that can replicate the issue by changing the shmem size:

main.txt (2.6 KB)

and here is the code for quick scanning:

#include <assert.h>
#include <climits>
#include <stdio.h>
#include <vector>

#include <cuda_runtime.h>
#include <cooperative_groups.h>

namespace cg = cooperative_groups;

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess)
   {
       fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);

       if (abort) exit(code);
   }
}

__global__ void squareArray(uint32_t s)
{
    extern __shared__ int shared_data[];
    cg::thread_block currentThreadBlock = cg::this_thread_block();
    auto tile64 = cg::tiled_partition<64>(currentThreadBlock);

    if (tile64.thread_rank() == 0) {
        shared_data[0] = s;
    }

    tile64.sync();

    auto warpTile32 = cg::tiled_partition<32>(currentThreadBlock);
    warpTile32.sync();
    tile64.sync();

}

int main(int argc, char **argv)
{
    cudaStream_t    stream;
    cudaGraph_t graph;
    cudaGraphExec_t graphExec;
    cudaGraphNode_t squareKernelNode;
    cudaKernelNodeParams kernelNodeParams = {0};
    uint32_t s = 123;
    void *squareKernelArgs[1] = {&s};

    gpuErrchk(cudaGraphCreate(&graph, 0));
    gpuErrchk(cudaStreamCreate(&stream));

    kernelNodeParams.func = (void *)squareArray;
    kernelNodeParams.gridDim = dim3(1024, 1, 1);
    kernelNodeParams.blockDim = dim3(128, 1, 1);
    kernelNodeParams.sharedMemBytes = 2048;
    kernelNodeParams.kernelParams = (void **)squareKernelArgs;

    gpuErrchk(cudaGraphAddKernelNode(&squareKernelNode, graph, nullptr, 0, &kernelNodeParams));
    gpuErrchk(cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0));

    gpuErrchk(cudaGraphExecKernelNodeSetParams(graphExec, squareKernelNode, &kernelNodeParams));
    gpuErrchk(cudaGraphLaunch(graphExec, stream));
    gpuErrchk(cudaStreamSynchronize(stream));

    // Second launch (Updated shared memory)
    kernelNodeParams.gridDim = dim3(1024, 1, 1);
    kernelNodeParams.sharedMemBytes = 1024;     // FIX: REPLACE WITH 2048 - same as the previous graph launch
    gpuErrchk(cudaGraphExecKernelNodeSetParams(graphExec, squareKernelNode, &kernelNodeParams));
    gpuErrchk(cudaGraphLaunch(graphExec, stream));
    gpuErrchk(cudaStreamSynchronize(stream));

    gpuErrchk(cudaStreamDestroy(stream));
    gpuErrchk(cudaGraphExecDestroy(graphExec));

    printf("Graph executions completed.\n");
    return 0;
}

Compute sanitizer error:

========= Invalid shared atomic of size 4 bytes
=========     at cooperative_groups::__v1::details::red_and_release_cta(unsigned int *, unsigned int)+0x220 in sync.h:149
=========     by thread (96,0,0) in block (47,0,0)
=========     Address 0x800 is out of bounds
=========         Device Frame: cooperative_groups::__v1::details::sync_warps_reset(unsigned int *, unsigned int)+0x1c0 in sync.h:270
=========         Device Frame: cooperative_groups::__v1::__static_size_multi_warp_tile_base<(unsigned int)64>::__static_size_multi_warp_tile_base<cooperative_groups::__v1::thread_block>(const T1 &)+0x30 in cooperative_groups.h:1365
=========         Device Frame: cooperative_groups::__v1::__multi_warp_thread_block_tile<(unsigned int)64, cooperative_groups::__v1::thread_block>::__multi_warp_thread_block_tile(const cooperative_groups::__v1::thread_block &)+0x10 in cooperative_groups.h:1494
=========         Device Frame: cooperative_groups::__v1::details::thread_block_tile_impl<(unsigned int)64, cooperative_groups::__v1::thread_block, (bool)1>::thread_block_tile_impl<cooperative_groups::__v1::thread_block>(const T1 &)+0x10 in cooperative_groups.h:1545
=========         Device Frame: cooperative_groups::__v1::thread_block_tile<(unsigned int)64, cooperative_groups::__v1::thread_block>::thread_block_tile(const cooperative_groups::__v1::thread_block &)+0x10 in cooperative_groups.h:1565
=========         Device Frame: cooperative_groups::__v1::details::tiled_partition_impl<(unsigned int)64, cooperative_groups::__v1::thread_block>::tiled_partition_impl(const cooperative_groups::__v1::thread_block &)+0x10 in cooperative_groups.h:1597
=========         Device Frame: cooperative_groups::__v1::thread_block_tile<T1, T2> cooperative_groups::__v1::tiled_partition<(unsigned int)64, cooperative_groups::__v1::thread_block>(const T2 &)+0x10 in cooperative_groups.h:1616
=========         Device Frame: squareArray(unsigned int)+0x10 in main.cu:29
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame: cudaGraphLaunch [0x4847d] in Fitting.exe
=========         Host Frame: main in main.cu:73 [0x352ec] in Fitting.exe

Graph executions completed.
========= LEAK SUMMARY: 0 bytes leaked in 0 allocations
========= ERROR SUMMARY: 4096 errors

I played around a bit with your code and believe that this is an issue with internal shared memory misconfiguration. Multi-warp tiles will make use of the 1kb reserved shared memor per block. for sm 80 and newer.

I modified your code to print some special registers which are configured for the driver shared memory, and added non-graph kernel launches at the end to compare graph shared memory configuration to traditional configuration.

#include <assert.h>
#include <climits>
#include <stdio.h>
#include <vector>

#include <cuda_runtime.h>
#include <cooperative_groups.h>

namespace cg = cooperative_groups;

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess)
   {
       fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);

       if (abort) exit(code);
   }
}

__global__ void squareArray(uint32_t s)
{
    int reserved_smem_offset_begin;
    int reserved_smem_offset_end;
    int reserved_smem_offset_cap;
    int reserved_smem_offset_0;
    int reserved_smem_offset_1;
    unsigned int total_smem_size;
    unsigned int aggr_smem_size;

    asm("mov.b32 %0, %reserved_smem_offset_begin;"
        "mov.b32 %1, %reserved_smem_offset_end;"
        "mov.b32 %2, %reserved_smem_offset_cap;"
        "mov.b32 %3, %reserved_smem_offset_0;"
        "mov.b32 %4, %reserved_smem_offset_1;"
        "mov.u32 %5, %total_smem_size;" //size of user-declared shared memory
        //"mov.u32 %6, %aggr_smem_size;" // size of user-declared + reserved shared memory. only available for >= sm_90
        : 
        "=r"(reserved_smem_offset_begin),
        "=r"(reserved_smem_offset_end),
        "=r"(reserved_smem_offset_cap),
        "=r"(reserved_smem_offset_0),
        "=r"(reserved_smem_offset_1),
        "=r"(total_smem_size)
        //"=r"(aggr_smem_size)
        );
    if(threadIdx.x + blockIdx.x == 0){
        printf("reserved_smem_offset_begin %d\n" 
            "reserved_smem_offset_end %d\n"
            "reserved_smem_offset_cap %d\n"
            "reserved_smem_offset_0 %d\n"
            "reserved_smem_offset_1 %d\n"
            "total_smem_size %d\n"
            "aggr_smem_size %d\n", 
            reserved_smem_offset_begin,
            reserved_smem_offset_end,
            reserved_smem_offset_cap,
            reserved_smem_offset_0,
            reserved_smem_offset_1,
            total_smem_size,
            aggr_smem_size
        );
    }


    extern __shared__ int shared_data[];
    if(threadIdx.x + blockIdx.x == 0){
        printf("dynamic smem begins at %lu\n",  __cvta_generic_to_shared(shared_data));
    }
    // cg::thread_block currentThreadBlock = cg::this_thread_block();
    // auto tile64 = cg::tiled_partition<64>(currentThreadBlock);

    // if (tile64.thread_rank() == 0) {
    //     shared_data[0] = s;
    // }

    // tile64.sync();

    // auto warpTile32 = cg::tiled_partition<32>(currentThreadBlock);
    // warpTile32.sync();
    // tile64.sync();

}

int main(int argc, char **argv)
{
    cudaStream_t    stream;
    cudaGraph_t graph;
    cudaGraphExec_t graphExec;
    cudaGraphNode_t squareKernelNode;
    cudaKernelNodeParams kernelNodeParams = {0};
    uint32_t s = 123;
    void *squareKernelArgs[1] = {&s};

    gpuErrchk(cudaGraphCreate(&graph, 0));
    gpuErrchk(cudaStreamCreate(&stream));

    kernelNodeParams.func = (void *)squareArray;
    kernelNodeParams.gridDim = dim3(1024, 1, 1);
    kernelNodeParams.blockDim = dim3(128, 1, 1);
    kernelNodeParams.sharedMemBytes = 2048;
    kernelNodeParams.kernelParams = (void **)squareKernelArgs;

    gpuErrchk(cudaGraphAddKernelNode(&squareKernelNode, graph, nullptr, 0, &kernelNodeParams));
    gpuErrchk(cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0));

    gpuErrchk(cudaGraphExecKernelNodeSetParams(graphExec, squareKernelNode, &kernelNodeParams));
    gpuErrchk(cudaGraphLaunch(graphExec, stream));
    gpuErrchk(cudaStreamSynchronize(stream));

    // Second launch (Updated shared memory)
    kernelNodeParams.gridDim = dim3(1024, 1, 1);
    kernelNodeParams.sharedMemBytes = 1024;     // FIX: REPLACE WITH 2048 - same as the previous graph launch
    gpuErrchk(cudaGraphExecKernelNodeSetParams(graphExec, squareKernelNode, &kernelNodeParams));
    gpuErrchk(cudaGraphLaunch(graphExec, stream));
    gpuErrchk(cudaStreamSynchronize(stream));

    gpuErrchk(cudaStreamDestroy(stream));
    gpuErrchk(cudaGraphExecDestroy(graphExec));

    printf("Graph executions completed.\n");


    squareArray<<<1024,128,2048>>>(s);
    squareArray<<<1024,128,1024>>>(s);

    cudaDeviceSynchronize();


    return 0;
}

With driver 580.95.05 and CUDA 13.0, I get the following output for sm_86 and sm_89, which produce the illegal memory access that you report.

reserved_smem_offset_begin 2048
reserved_smem_offset_end 2336
reserved_smem_offset_cap 3072
reserved_smem_offset_0 2048
reserved_smem_offset_1 2048
total_smem_size 2048
aggr_smem_size 0
dynamic smem begins at 0

reserved_smem_offset_begin 1024
reserved_smem_offset_end 2336
reserved_smem_offset_cap 2048
reserved_smem_offset_0 1024
reserved_smem_offset_1 2048
total_smem_size 1024
aggr_smem_size 0
dynamic smem begins at 0
Graph executions completed.

reserved_smem_offset_begin 2048
reserved_smem_offset_end 2336
reserved_smem_offset_cap 3072
reserved_smem_offset_0 2048
reserved_smem_offset_1 2048
total_smem_size 2048
aggr_smem_size 0
dynamic smem begins at 0

reserved_smem_offset_begin 1024
reserved_smem_offset_end 1312
reserved_smem_offset_cap 2048
reserved_smem_offset_0 1024
reserved_smem_offset_1 1024
total_smem_size 1024
aggr_smem_size 0
dynamic smem begins at 0

Note that reserved shared memory is located after dynamic shared memory, but more importantly, output does not match between the updated kernel in graph, and the standalone kernel with smem 1024.

On the otherhand, sm_120 will not produce illegal memory access, and its configuration output is consistent. (Also, reserved shared memory is located before dynamic shared memory.)

reserved_smem_offset_begin 0
reserved_smem_offset_end 288
reserved_smem_offset_cap 1024
reserved_smem_offset_0 64
reserved_smem_offset_1 0
total_smem_size 2048
aggr_smem_size 0
dynamic smem begins at 1024

reserved_smem_offset_begin 0
reserved_smem_offset_end 288
reserved_smem_offset_cap 1024
reserved_smem_offset_0 64
reserved_smem_offset_1 0
total_smem_size 1024
aggr_smem_size 0
dynamic smem begins at 1024

Graph executions completed.
reserved_smem_offset_begin 0
reserved_smem_offset_end 288
reserved_smem_offset_cap 1024
reserved_smem_offset_0 64
reserved_smem_offset_1 0
total_smem_size 2048
aggr_smem_size 0
dynamic smem begins at 1024

reserved_smem_offset_begin 0
reserved_smem_offset_end 288
reserved_smem_offset_cap 1024
reserved_smem_offset_0 64
reserved_smem_offset_1 0
total_smem_size 1024
aggr_smem_size 0
dynamic smem begins at 1024

I would suggesting filing a bug report , see How to report a bug

The following work-around might be possible.

For archs < sm_80, the shared memory for internal CG use must be given explicitly.

If that is done, and the code is compiled to Turing PTX, then to >= Ampere SASS, I do not observe an illegal memory access.

Specifically, I tried the following modification:

    constexpr int blocksize_of_kernel = 128;
    __shared__ cg::block_tile_memory<blocksize_of_kernel> smemForCGs;
    cg::thread_block currentThreadBlock = cg::this_thread_block(smemForCGs);
    auto tile64 = cg::tiled_partition<64>(currentThreadBlock);

And compiled for sm_86 using the command nvcc --generate-code=arch=compute_75,code=sm_86 main.cu -o main

Thanks for filing a ticket 6090062. We will bring back conclusion here when the ticket cycle is done internally.

6090062 is fixed. The fix will aim CUDA 13.4. Thanks for the great reproducers and investigation in this report which has been very helpful.

Best,
Yuki