How the Shared Memory Configuration Size is calcuated？

ning6 · January 21, 2025, 5:22am

I noticed in Nsight Compute that a kernel’s Shared Memory Configuration Size [Kbyte] is 65.54, while the sum of Static, Dynamic, and Driver Shared Memory is 5.24, as shown in the screenshot below.

How the Shared Memory Configuration Size is calculated?
Is the Block Limit Shared Mem [block] value determined based on the sum of Static, Dynamic, and Driver Shared Memory, or on the Shared Memory Configuration Size?

rs277 · January 21, 2025, 7:18am

Starting with Volta, (CC7.0), the L1 cache and shared memory reside in the same space and the split in resources between them is configurable, see here. In this instance it appears your card has 64kB available for use.

It’s determined based on the amount you actually use.

ning6 · January 22, 2025, 4:31am

Thanks for your reply.

I am a beginner in CUDA and would like to confirm whether my understanding is correct.

Based on the data shown in Nsight Systems, the kernel allocated 65.54 KB of shared memory upon launch. According to the actual usage, it should be 4.22 KB + 1.02 KB = 5.24 KB per block. The GPU I am using is the 3090, with a maximum shared memory per SM of 102.4 KB. Therefore, the block limit due to shared memory would be 102.4 / 5.24 = 19. Is this understanding correct?

rs277 · January 22, 2025, 4:54am

Not quite. It will allocate 5.24kB out of a configured limit of 65.54. If you raised the limit to 102.4, (the maximum), the block limit would be 19.

Your kernel is block limited to 6 blocks due to register and warp limits.

ning6 · January 22, 2025, 10:08pm

Thank you so much for your detailed response. I have one more question, if you don’t mind:

When running multiple kernels in parallel using GPU streams, and each kernel has a different shared memory configuration size, will the shared memory configuration default to the largest size, or will it be based on the first launched kernel?

rs277 · January 22, 2025, 11:57pm

I admit this is not an area I have experience in - I’m still on Pascal, so hopefully someone will correct me if needed.

The way I’d interpret this, would be to sum the shared allocations across the kernels running in parallel and set the carveout appropriately if required. Don’t forget that allocations larger than 48k per block must be done dynamically.

ning6 · January 23, 2025, 9:23am

Thank you for your explanation! It already clarified my some doubts and was very helpful. Much appreciated!

Topic		Replies	Views
How does the driver calculate the "Shared Memory Configuration Size" value? CUDA Programming and Performance nsight-compute	10	131	April 18, 2026
The configurations of the shared and l1 caches ?! Nsight Compute	4	788	January 22, 2024
Meaning of "Shared Memory Configuration Size"? Nsight Compute cuda , nsight	3	2412	January 20, 2022
How is "Shared Memory Configuration Size" calculated? Nsight Compute	2	795	July 6, 2022
Dynamic shared memory calculated by ncu larger than Max_shared_memory_per_block Nsight Compute cuda	4	825	September 21, 2023
Amount of usable shared memory? CUDA Programming and Performance	2	2281	May 31, 2012
Shared memory per block CUDA Programming and Performance	1	6874	March 29, 2012
Why is shared memory configuration size is limiting the occupancy CUDA Programming and Performance kernel , profiling	2	1259	June 4, 2023
Shared memory declaration Simple question about shared memory CUDA Programming and Performance	2	3036	October 5, 2009
invalid configuration argument caused by shared memory? CUDA Programming and Performance	3	22580	May 28, 2009

How the Shared Memory Configuration Size is calcuated？

Related topics