The configurations of the shared and l1 caches ?!

Does anyone knows how the shared/l1 size configurations change ?
For example,

  • if the used shared memory in my app per block is 24KB (static+dynamic)
  • one block per SM
    is the configuration will be 32KB for shared ?
    is there any way to measure or read the configurations dynamically ?

I’m using A100 GPU.

Refer the Share Memory section in the CUDA Programming Guide.

You can use the CUDA Runtime API cudaFuncGetAttributes with the attribute cudaFuncAttributePreferredSharedMemoryCarveout to query the shared memory carveout.

You can find the “Shared Memory Configuration Size” under Details->Launch Statistics section of Nsight Compute.

1 Like

Thanks for this answer.
I’ve read the carvout and found it (-1) which means no preference
I’ve used the Launch Statistics section to read the used shared size and found it =100KB.
The question here is, why it is chosen 100KB although we need only 24 KB
what is rule for choosing the shared memory size which in-turn will affect the cache memory size and thus its performance.
and also I’m asking if there’s any method or API to read the shared memory size (not from the profiler)

The heuristic is not documented. The runtime value can be different per TPC/SM.

Is the grid launch limited by occupancy to 1 thread block per SM or is the code simply limiting the grid dimensions to SM count? If the resource usage allows more than 1 thread block per SM there is a high likelihood that the driver is increasing the shared memory size to allow for multiple thread blocks per SM. In general the driver (and compiler) do not adjust to grid dimensions.

NSYS has single pass PM metrics that will show the average shared memory allocated but the configuration is not available at this time for A100. It is only available for graphics focused cards (GA10x, AD10x, …).

This topic was automatically closed after 13 days. New replies are no longer allowed.