The Ampere docs are somewhat confusing in this regard. On the one hand, the docs claim that 0 K of shared memory per kernel is possible (a.k.a. “max L1 carveout”), but at the same time, the docs claim that 1 K is reserved per block by the runtime.
Anecdotally, Nsight analysis of a simple kernel with no shared memory usage implies that the runtime always reserves 1024 bytes, even if the kernel uses no shared memory and therefore it might be impossible to have the “max L1” carve out.
So what gives? Are there tricks beyond the cudaSharedmemCarveoutMaxL1
hint for achieving zero reserved shared memory?