I have a question regarding how the Shared Memory Configuration Size is calculated when I run a program on a GV100 GPU using Nsight Compute.
The definition of Shared Memory Configuration Size is:
Shared Memory Configuration Size indicates the shared memory size, in bytes, that is configured by the CUDA driver for this kernel launch, per block, taking into account all other configuration options and constraints set by the application, the CUDA driver or the HW. It is calculated by the driver and directly reported by the tool.
The GV100 uses a unified shared memory and L1 cache, and the configurable shared memory sizes are 0, 8, 16, 32, 64, 96 KB. The documentation (NVIDIA CUDA Library: cudaFuncSetCacheConfig) states that we can control the L1 cache and shared memory size using cudaFuncSetCacheConfig. The supported cache configurations are:
cudaFuncCachePreferNone: no preference for shared memory or L1 (default)cudaFuncCachePreferShared: prefer larger shared memory and smaller L1 cachecudaFuncCachePreferL1: prefer larger L1 cache and smaller shared memory
However, I don’t know what the default L1 and shared memory sizes are.
Also, if I select cudaFuncCachePreferShared, does the driver choose 64 KB shared memory + 32 KB L1? For a GPU like the RTX 4090, which has 128 KB of unified shared memory and L1 cache per SM, the configurable shared memory sizes are 0, 8, 16, 32, 64, 96, 128 KB. In that case, when cudaFuncCachePreferShared is selected, would it choose 96 KB shared + 32 KB L1? Is that correct?
I have already read the post: How is "Shared Memory Configuration Size" calculated?, but I still don’t understand. Does anyone know how the driver actually calculates this value?
