Refer the Share Memory section in the CUDA Programming Guide.
You can use the CUDA Runtime API cudaFuncGetAttributes with the attribute cudaFuncAttributePreferredSharedMemoryCarveout to query the shared memory carveout.
You can find the “Shared Memory Configuration Size” under Details->Launch Statistics section of Nsight Compute.
Thanks for this answer.
I’ve read the carvout and found it (-1) which means no preference
I’ve used the Launch Statistics section to read the used shared size and found it =100KB.
The question here is, why it is chosen 100KB although we need only 24 KB
what is rule for choosing the shared memory size which in-turn will affect the cache memory size and thus its performance.
and also I’m asking if there’s any method or API to read the shared memory size (not from the profiler)
The heuristic is not documented. The runtime value can be different per TPC/SM.
Is the grid launch limited by occupancy to 1 thread block per SM or is the code simply limiting the grid dimensions to SM count? If the resource usage allows more than 1 thread block per SM there is a high likelihood that the driver is increasing the shared memory size to allow for multiple thread blocks per SM. In general the driver (and compiler) do not adjust to grid dimensions.
NSYS has single pass PM metrics that will show the average shared memory allocated but the configuration is not available at this time for A100. It is only available for graphics focused cards (GA10x, AD10x, …).