How does the driver calculate the "Shared Memory Configuration Size" value?

I have a question regarding how the Shared Memory Configuration Size is calculated when I run a program on a GV100 GPU using Nsight Compute.

The definition of Shared Memory Configuration Size is:

Shared Memory Configuration Size indicates the shared memory size, in bytes, that is configured by the CUDA driver for this kernel launch, per block, taking into account all other configuration options and constraints set by the application, the CUDA driver or the HW. It is calculated by the driver and directly reported by the tool.

The GV100 uses a unified shared memory and L1 cache, and the configurable shared memory sizes are 0, 8, 16, 32, 64, 96 KB. The documentation (NVIDIA CUDA Library: cudaFuncSetCacheConfig) states that we can control the L1 cache and shared memory size using cudaFuncSetCacheConfig. The supported cache configurations are:

  • cudaFuncCachePreferNone: no preference for shared memory or L1 (default)
  • cudaFuncCachePreferShared: prefer larger shared memory and smaller L1 cache
  • cudaFuncCachePreferL1: prefer larger L1 cache and smaller shared memory

However, I don’t know what the default L1 and shared memory sizes are.

Also, if I select cudaFuncCachePreferShared, does the driver choose 64 KB shared memory + 32 KB L1? For a GPU like the RTX 4090, which has 128 KB of unified shared memory and L1 cache per SM, the configurable shared memory sizes are 0, 8, 16, 32, 64, 96, 128 KB. In that case, when cudaFuncCachePreferShared is selected, would it choose 96 KB shared + 32 KB L1? Is that correct?

I have already read the post: How is "Shared Memory Configuration Size" calculated?, but I still don’t understand. Does anyone know how the driver actually calculates this value?

I am not sure if the calculation done by the driver is public or not. I am moving this to a more appropriate forum section, since the question is effectively to the CUDA driver, not Nsight Compute.

Thanks for the move!

Those settings are old (probably compute capability 2.0, when there were two settings or 3.0, when there were three settings). They are only needed for dynamic shared memory (i.e. shared memory, for which the amount is chosen at the kernel launch as launch parameter).

A more modern setting is cudaFuncAttributePreferredSharedMemoryCarveout with CUDA Runtime API :: CUDA Toolkit Documentation - it allows an exact percentage as function parameter. The available settings are of course as stated by you.

I would assume that (for compatibility reasons) the old function relates to the 16 KiB / 32 KiB / 48 KiB of 64 KiB, and gives the remainder to L1 cache.

The driver has the last word. So best try it out. Or focus on static shared memory, e.g. by putting a byte buffer there.

Thanks, that makes sense. I’ll try the modern carveout API and run some minor experiments to see the actual shared memory configuration.

Also please consider combining with cudaFuncAttributeMaxDynamicSharedMemorySize

Thank you for your reminder. I checked the CUDA C++ Programming Guide (CUDA C++ Programming Guide (Legacy) — CUDA C++ Programming Guide):

“the driver automatically configures the shared memory capacity for each kernel to avoid shared memory occupancy bottlenecks while also allowing concurrent execution with already launched kernels where possible. In most cases, the driver’s default behavior should provide optimal performance.”

So I ran some simulations. Here is an example:

A program executed on GV100, where the resources used per thread block (except driver shared memory) can be analyzed from the CUDA program. Since driver shared memory is small, I assume it to be 0:

  • Static shared memory: 12.29 KB
  • Dynamic shared memory: 0.0 KB
  • Driver shared memory: 0.0 KB
  • Registers per thread: 255
  • Threads per block: 64

Based on this information, I simulated the theoretical occupancy under different amounts of allocated shared memory per SM (GV100 has a total unified L1/shared memory of 128 KB, and the configurable shared memory sizes are 0, 8, 16, 32, 64, or 96 KB). The simulation result:

======================================================================
SharedMemoryConfigSize simulator verification program
======================================================================

Simulation test (1 case):
----------------------------------------------------------------------
[*] Test case: GV100 GEMM (12.29KB, 255 regs, 64 threads)
    Static: 12.29 KB
    Dynamic: 0.0 KB
    Driver: 0.0 KB
    Compute capability: 70
    Registers per thread: 255
    Block size: 64

[*] GPU info:
    Name: sm_70
    Compute Capability: 7.0
    Shared memory config options: [0, 8, 16, 32, 64, 96] KB
    Max resident blocks per SM: 32
    Max resident warps per SM: 64
    Max registers per SM: 65536

[*] SMEM to CTAS occupancy:
    Config size: 0 KB: Max resident CTAs: 0, Theoretical occupancy: 0.00%
    Config size: 8 KB: Max resident CTAs: 0, Theoretical occupancy: 0.00%
    Config size: 16 KB: Max resident CTAs: 1, Theoretical occupancy: 3.12%
    Config size: 32 KB: Max resident CTAs: 2, Theoretical occupancy: 6.25%
    Config size: 64 KB: Max resident CTAs: 4, Theoretical occupancy: 12.50%
    Config size: 96 KB: Max resident CTAs: 4, Theoretical occupancy: 12.50%

As shown, when the shared memory config size is 64 KB, the theoretical occupancy reaches exactly 12.5%. If the config size is increased further, the theoretical occupancy does not increase; instead, the smaller L1 cache may degrade performance. Then I verified this in Nsight Compute, and it indeed allocated 64 KB of shared memory per SM. Some other programs have also verified this simulation method.

I hope this can help others.

As the quote from the Programming Guide indicates, driver behaviour is dynamic and takes concurrent execution with other kernels into account. As it has to, since the smem/L1 split is SM-wide and affects all threadblocks/kernels running on an SM.

So if the GPU is idle a kernel A that uses 2KiB shared memory per threadblock might launch with a carveout of 8 or 16KiB. But if there’s already a kernel B running which uses 34KiB per threadblock then the carveout might already be at 96KiB, with two threadblocks of B and 28KiB unused. That’s more than enough to launch several threadblocks of A, as long as occupancy isn’t limited by other resources.

I don’t know if SMs can change their smem/L1 configuration while having active threadblocks. I suspect not, or possibly only in the direction of growing smem and shrinking L1.

Thanks, I agree with your analysis. When there are already active blocks (from kernel B) resident on an SM, the driver likely cannot change the L1/smem configuration. Because, if it is changed, blocks from kernel B may be unable to access the correct data from shared memory.

I’ve only considered simple single-kernel execution scenarios so far, and I don’t have ideas about the complexity of concurrency yet.

You inspired me in one way: I realized another constraint that needs to be taken into account — the maximum number of blocks that each SM needs to execute. For example, if the total number of blocks is small such that each SM only needs to execute at most 3 blocks, then allocating more shared memory than what is needed for 3 blocks is meaningless although although that will increase theoretical occupancy. This upper limit should also be considered.

On CC < GH100, the TPC (2 SMs) has to be idle before the L1/SHMEM configuration can be changed.
On CC >= GH100, the SM has to be idle before the L1/SHMEM configuration can be changed.

There is a challenge that when grids are launched on different streams without synchronization the unpredictable order can result in greatly different order based upon the runtime L1/SHMEM configuration. For example, you can have two kernels A and B where A require shared memory and B requires no shared memory. If A is distributed before B is distributed, then A and B may be able to run concurrently. If B is distributed before A is distributed, then B and A may not be able to run concurrently if the L1/SHMEM configuration requires a reconfiguration for A to run.

That last bit is an interesting effect and can explain some strange behaviors of concurrent kernels.