Can't launch 1 block with 1024 threads when maximizing shared memory using cudaFuncSetAttribute

Just a strange error where I can’t launch a block with 1024 threads if I maximize shared memory using

  returnVal = cudaFuncSetAttribute(

To dynamically request all the possible shared memory.

cudaErrorLaunchOutOfResources (error 701) due to “too many resources requested for launch”

Works fine for 512 threads.

Maybe the actual problem is a register problem and not a shared memory problem. I think it’s probably not practical to diagnose an issue based on one line of code or the level of information provided so far. OTOH you might get much more useful help if you provided a short, complete test case that demonstrates the issue. Note that asking for a “short, complete test case” is not the same as asking for “your whole code”.

Do as you wish, of course. Just a suggestion.

The global kernel uses 92 registers
ptxas info : Used 92 registers, 16016 bytes smem, 904 bytes cmem[0], 4 bytes cmem[2]

The global kernel in question calls 3 device kernels
ptxas info : Used 40 registers, 16016 bytes smem, 512 bytes cmem[0]
ptxas info : Used 40 registers, 16016 bytes smem, 492 bytes cmem[0], 4 bytes cmem[2]
ptxas info : Used 40 registers, 16016 bytes smem, 560 bytes cmem[0]

Going off the global kernel’s register usage 1024*92 > 64K, which appears to be the limit for ampere registers per SM.