The configurations of the shared and l1 caches ?!

user40368 · December 11, 2023, 9:58pm

Does anyone knows how the shared/l1 size configurations change ?
For example,

if the used shared memory in my app per block is 24KB (static+dynamic)
one block per SM
is the configuration will be 32KB for shared ?
is there any way to measure or read the configurations dynamically ?

I’m using A100 GPU.

Sanjiv.Satoor · December 15, 2023, 6:33am

Refer the Share Memory section in the CUDA Programming Guide.

You can use the CUDA Runtime API cudaFuncGetAttributes with the attribute cudaFuncAttributePreferredSharedMemoryCarveout to query the shared memory carveout.

You can find the “Shared Memory Configuration Size” under Details->Launch Statistics section of Nsight Compute.

user40368 · December 16, 2023, 9:35am

Thanks for this answer.
I’ve read the carvout and found it (-1) which means no preference
I’ve used the Launch Statistics section to read the used shared size and found it =100KB.
The question here is, why it is chosen 100KB although we need only 24 KB
what is rule for choosing the shared memory size which in-turn will affect the cache memory size and thus its performance.
and also I’m asking if there’s any method or API to read the shared memory size (not from the profiler)

Greg · December 16, 2023, 7:24pm

The heuristic is not documented. The runtime value can be different per TPC/SM.

Is the grid launch limited by occupancy to 1 thread block per SM or is the code simply limiting the grid dimensions to SM count? If the resource usage allows more than 1 thread block per SM there is a high likelihood that the driver is increasing the shared memory size to allow for multiple thread blocks per SM. In general the driver (and compiler) do not adjust to grid dimensions.

NSYS has single pass PM metrics that will show the average shared memory allocated but the configuration is not available at this time for A100. It is only available for graphics focused cards (GA10x, AD10x, …).

veraj · January 22, 2024, 12:00am

This topic was automatically closed after 13 days. New replies are no longer allowed.

Topic		Replies	Views
Meaning of "Shared Memory Configuration Size"? Nsight Compute cuda , nsight	3	2081	January 20, 2022
Why is shared memory configuration size is limiting the occupancy CUDA Programming and Performance kernel , profiling	2	879	June 4, 2023
Optimisation of occupancy summary table CUDA Programming and Performance	10	723	September 15, 2023
How is "Shared Memory Configuration Size" calculated? Nsight Compute	2	666	July 6, 2022
L1 data cache/shared memory size in Volta architecture CUDA Programming and Performance	4	1655	February 13, 2020
Configurable Shared Memory Size / SM CUDA Programming and Performance	1	793	November 8, 2021
Shared/cache memory management for HPC with large data required per thread CUDA Programming and Performance	6	1127	May 10, 2017
Querying amount of shared memory allocated CUDA Programming and Performance	8	3065	January 9, 2009
Occupancy is not like I expected CUDA Programming and Performance	4	522	June 29, 2020
CUDA Device Query says P100 has 48kb shared memory/block... but it's supposed to be 64kb? CUDA Programming and Performance	3	1085	June 4, 2017

The configurations of the shared and l1 caches ?!

Related topics