I noticed in Nsight Compute that a kernel’s Shared Memory Configuration Size [Kbyte] is 65.54, while the sum of Static, Dynamic, and Driver Shared Memory is 5.24, as shown in the screenshot below.
How the Shared Memory Configuration Size is calculated?
Is the Block Limit Shared Mem [block] value determined based on the sum of Static, Dynamic, and Driver Shared Memory, or on the Shared Memory Configuration Size?
Starting with Volta, (CC7.0), the L1 cache and shared memory reside in the same space and the split in resources between them is configurable, see here. In this instance it appears your card has 64kB available for use.
It’s determined based on the amount you actually use.
I am a beginner in CUDA and would like to confirm whether my understanding is correct.
Based on the data shown in Nsight Systems, the kernel allocated 65.54 KB of shared memory upon launch. According to the actual usage, it should be 4.22 KB + 1.02 KB = 5.24 KB per block. The GPU I am using is the 3090, with a maximum shared memory per SM of 102.4 KB. Therefore, the block limit due to shared memory would be 102.4 / 5.24 = 19. Is this understanding correct?
Thank you so much for your detailed response. I have one more question, if you don’t mind:
When running multiple kernels in parallel using GPU streams, and each kernel has a different shared memory configuration size, will the shared memory configuration default to the largest size, or will it be based on the first launched kernel?
I admit this is not an area I have experience in - I’m still on Pascal, so hopefully someone will correct me if needed.
The way I’d interpret this, would be to sum the shared allocations across the kernels running in parallel and set the carveout appropriately if required. Don’t forget that allocations larger than 48k per block must be done dynamically.