How the Shared Memory Configuration Size is calcuated?

I noticed in Nsight Compute that a kernel’s Shared Memory Configuration Size [Kbyte] is 65.54, while the sum of Static, Dynamic, and Driver Shared Memory is 5.24, as shown in the screenshot below.

  1. How the Shared Memory Configuration Size is calculated?
  2. Is the Block Limit Shared Mem [block] value determined based on the sum of Static, Dynamic, and Driver Shared Memory, or on the Shared Memory Configuration Size?

1 Like

Starting with Volta, (CC7.0), the L1 cache and shared memory reside in the same space and the split in resources between them is configurable, see here. In this instance it appears your card has 64kB available for use.

It’s determined based on the amount you actually use.

1 Like

Thanks for your reply.

I am a beginner in CUDA and would like to confirm whether my understanding is correct.

Based on the data shown in Nsight Systems, the kernel allocated 65.54 KB of shared memory upon launch. According to the actual usage, it should be 4.22 KB + 1.02 KB = 5.24 KB per block. The GPU I am using is the 3090, with a maximum shared memory per SM of 102.4 KB. Therefore, the block limit due to shared memory would be 102.4 / 5.24 = 19. Is this understanding correct?

Not quite. It will allocate 5.24kB out of a configured limit of 65.54. If you raised the limit to 102.4, (the maximum), the block limit would be 19.

Your kernel is block limited to 6 blocks due to register and warp limits.

Thank you so much for your detailed response. I have one more question, if you don’t mind:

When running multiple kernels in parallel using GPU streams, and each kernel has a different shared memory configuration size, will the shared memory configuration default to the largest size, or will it be based on the first launched kernel?

I admit this is not an area I have experience in - I’m still on Pascal, so hopefully someone will correct me if needed.

The way I’d interpret this, would be to sum the shared allocations across the kernels running in parallel and set the carveout appropriately if required. Don’t forget that allocations larger than 48k per block must be done dynamically.

Thank you for your explanation! It already clarified my some doubts and was very helpful. Much appreciated!