I’m running an application with different image stacks of different shapes (H, W, D) and I’m creating a CudaGraph only once per the runtime for these different inputs using cudaGraphCreate and cudaGraphInstantiate where I add the graph nodes and the kernel parameters.
There is an issue when I’m trying to update the dynamic shared memory size in the cudaKernelNodeParams. Inside the same kernel where I need to use the dynamic shared memory, the cooperative group is getting stuck in synchronization. Specifically at this callstack:
> [CUDA]kernels.obj !ld_acquire_cta Line 160 [0x0000001300cd48e0]
[CUDA]kernels.obj !barrier_wait Line 177 [0x0000001300cd4920]
[CUDA]kernels.obj !sync_warps Line 195 [0x0000001300cd4980]
[CUDA]kernels.obj !sync Line 1458 [0x0000001300cd4a10]
[CUDA]kernels.obj !operator() Line 339 [0x0000001300cd4a40]
From within the sync.h
// Read the barrier, acquire to ensure all memory operations following the sync are correctly performed after it is released
_CG_STATIC_QUALIFIER unsigned int ld_acquire_cta(unsigned int *addr) {
unsigned int val;
NV_IF_ELSE_TARGET(NV_PROVIDES_SM_70,
(asm volatile("ld.acquire.cta.u32 %0,[%1];" : "=r"(val) : _CG_ASM_PTR_CONSTRAINT(addr) : "memory");)
,
(val = *((volatile unsigned int*) addr);
__threadfence_block();)
);
return val;
}
It’s worth noting the following behavior:
- The application works fine only when I’m running 1 image stack and the result is correct; even on the one that was stuck.
- When the application runs multiple image stacks over the same pipeline, the issue doesn’t happen when I’m allocating large amount of dynamic shared memory than needed i.e. 16KB. The results per image stack was also correct.
- The issue persists if I try to change the dynamic shmem size after the 1st image stack.
- Compute sanitizer doesn’t report any memory check errors or possible race conditions that will affect this synchronization as it literally happens at the beginning of the kernel. It also gets stuck under the same runtime conditions
I always do the following after updating the kernel node parameters:
- cudaGraphExecKernelNodeSetParams
- cudaGraphLaunch
CUDA version:
Built on Wed_Jan_15_19:38:46_Pacific_Standard_Time_2025
Cuda compilation tools, release 12.8, V12.8.61
Build cuda_12.8.r12.8/compiler.35404655_0
GPU:
Device 0: NVIDIA RTX 2000 Ada Generation
Compute Capability: 8.9
CUDA Cores/SM (approx): 22 (SMs, cores vary by architecture)
Total Global Memory: 16379 MB
Shared Memory per Block: 48 KB
Shared Memory per SM: 100 KB
L2 Cache Size: 24576 KB
Registers per Block: 65536
Registers per SM: 65536
Is there a restriction where I can’t change the dynamic shared memory size once the graph finished execution? specially with a kernel that uses cooperative groups.