Is a "max L1" carveout on Ampere actually possible?

The Ampere docs are somewhat confusing in this regard. On the one hand, the docs claim that 0 K of shared memory per kernel is possible (a.k.a. “max L1 carveout”), but at the same time, the docs claim that 1 K is reserved per block by the runtime.

Anecdotally, Nsight analysis of a simple kernel with no shared memory usage implies that the runtime always reserves 1024 bytes, even if the kernel uses no shared memory and therefore it might be impossible to have the “max L1” carve out.

So what gives? Are there tricks beyond the cudaSharedmemCarveoutMaxL1 hint for achieving zero reserved shared memory?

The 1KB of shared memory per threadblock is not avoidable.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.