I’m working on compiling a custom CUDA kernel for an attention mechanism on an sm_90a (Hopper) architecture, and I’ve encountered a ptxas warning that has me a bit confused. The compiler is reporting that it’s using local memory, but it also explicitly states that there are no register spills.
My understanding was that local memory is primarily used when the number of live variables exceeds the available registers, forcing a “spill-to-local”. However, that doesn’t seem to be the case here.
Here is the relevant output from the compiler:
ptxas warning : Local memory used for function '_ZN7cutlass13device_kernelIN5flash20...<omitted>...', size of stack frame: 64 bytes
ptxas info : Function properties for _ZN7cutlass13device_kernelIN5flash20...<omitted>...
64 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 255 registers, used 16 barriers, 64 bytes cumulative stack size, 512 bytes smem
As you can see, ptxas warns about local memory usage because of a 64-byte stack frame. However, in the detailed info, it clearly reports 0 bytes spill stores and 0 bytes spill loads.
The register usage is at the absolute maximum of 255 for this architecture.
This leads to my main question:
Under what circumstances does the CUDA compiler create a stack frame and use local memory, even when it’s not explicitly spilling registers?
I’m trying to optimize my kernel and want to fully understand the root cause of this local memory usage. Any insights into this compiler behavior or pointers to relevant documentation would be greatly appreciated!
Thanks in advance.