Understanding ptxas Local Memory Usage: Why is a Stack Frame created without Register Spills?

I’m working on compiling a custom CUDA kernel for an attention mechanism on an sm_90a (Hopper) architecture, and I’ve encountered a ptxas warning that has me a bit confused. The compiler is reporting that it’s using local memory, but it also explicitly states that there are no register spills.

My understanding was that local memory is primarily used when the number of live variables exceeds the available registers, forcing a “spill-to-local”. However, that doesn’t seem to be the case here.

Here is the relevant output from the compiler:

ptxas warning : Local memory used for function '_ZN7cutlass13device_kernelIN5flash20...<omitted>...', size of stack frame: 64 bytes

ptxas info    : Function properties for _ZN7cutlass13device_kernelIN5flash20...<omitted>...
    64 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 255 registers, used 16 barriers, 64 bytes cumulative stack size, 512 bytes smem

As you can see, ptxas warns about local memory usage because of a 64-byte stack frame. However, in the detailed info, it clearly reports 0 bytes spill stores and 0 bytes spill loads.

The register usage is at the absolute maximum of 255 for this architecture.

This leads to my main question:

Under what circumstances does the CUDA compiler create a stack frame and use local memory, even when it’s not explicitly spilling registers?

I’m trying to optimize my kernel and want to fully understand the root cause of this local memory usage. Any insights into this compiler behavior or pointers to relevant documentation would be greatly appreciated!

Thanks in advance.

stack is used for other things besides register spills, such as creating a stack frame that is used to facilitate a function call

local memory can also be used for immediate or “stack” variables

stack is a compiler/processor mechanism that exists in the logical local space; it is a per-thread-local entity

I don’t have an exhaustive list and I don’t think NVIDIA documents an exhaustive list of all the reasons that local memory and/or stack might be used, but general processor principles may be instructive.

This related post may be of interest.

Thank you for your response! I still have a slight confusion—are using local memory and register spilling equivalent (as I initially thought)? From the compilation information, it seems they are not. Does this mean register spilling leads to using local memory, but using local memory doesn’t necessarily imply register spilling? If so, could you provide an example of the latter?

No, they are not equivalent.

Yes.

An example of local memory usage could be:

int a[1024];

in device code.

1 Like