Differences Between Stack Frame, Spill Stores, and Spill Loads

202476410arsmart · December 24, 2024, 3:08am

What are the differences between Stack Frame (bytes), Spill Stores (bytes), and Spill Loads (bytes)?

rs277 · December 24, 2024, 3:16am

Definitions here.

202476410arsmart · December 24, 2024, 3:25am

Thanks! Well, still vague… For me as a cuda programmer, maybe these three variables are similar? The larger the worse?

Robert_Crovella · December 24, 2024, 3:51am

a stack is not a concept that is unique or specific to CUDA. If you want to know what a stack is or why it is used in modern processors, you should be able to find plenty of resources on the web for that. A stack frame is simply the space utilized by the stack, or the space utilized to conduct a particular operation on/with the stack (such as a function or subroutine call.)

spill stores and spill loads relate to the usage of variables in the logical local space. Such variables may manifest in a register, or they may manifest in DRAM memory, or perhaps the caches. The GPU is a load/store architecture for the most part, so when it comes time to use variables in calculations, they almost universally manifest in GPU registers.

A logical local space variable might appear in source code like:

int a;

So if there were no other constraints, the compiler would simply choose to “locate” or “manifest” all logical local space variables in registers. But, unfortunately, there are other constraints. Two of the most relevant are:

there is not an unlimited supply of registers
registers cannot be indexed, that is I cannot select a register as an operand, based on the numerical contents of another register

Because of this, the compiler may choose to “locate” or “manifest” logical local space variables at certain points in execution in registers, and at other points in execution in memory (or cache). Since the “location” of a variable may change over the duration of code execution, “movement” of the variable may be necessary.

The SASS instruction that would typically be used to “send” a local variable from register to memory would be STL. The SASS instruction that would be used to “send” a local variable from memory to register would be LDL. When the compiler chooses to relocate a logical local space variable from a register to memory (with the intent of loading it later, of course) that would be referred to as a “spill store”, accomplished via STL. When the compiler chooses to relocate a logical local space variable from memory to register that would be referred to as a “spill load”.

In an ideal machine (infinite supply of registers, etc.), we would prefer not to have either of these mechanisms taking place. They “cost” something. It takes an instruction issue slot, and it also accesses the memory subsystem in some fashion, with all that that implies. However in some cases, the compiler decides they are “necessary”. Like many other performance considerations in a GPU, there is a tradeoff involved. Its usually not sensible just to take the attitude that “spills are bad” because they may indeed be “necessary”. But if you can refactor a code to reduce spill loads and spill stores, it may result in increased performance.

Mindless techniques such as using register controls to increase register usage are generally not that effective, in my experience. The compiler generally makes pretty good decisions about register usage. The compiler is ultimately after maximum delivered performance, not any other metric as a first order optimization goal. Therefore if the compiler is generating spill operations, it generally believes that is the best compromise to achieve best performance.

YMMV. Exceptions and bugs are always possible. Experimentation and variation of parameters such as register limits may produce faster code.

You can find many forum articles discussing spills.

The stack frame and e.g. STL instructions may be used for other purposes; I already mentioned one possible use - as a data repository for arguments passed to a function/subroutine call (again, not unique or specific to CUDA or GPUs). Therefore, the simple presence of STL instructions in your SASS dump does not necessarily mean that spill stores are occurring; use the available compiler output to determine that conclusively.

Curefab · December 24, 2024, 9:44am

Generally in computer architectures the stack of a thread (and not stacks in other used data structures) is mostly used for storing local variables and parameters of functions.

The local variables extend and contract, when new scopes (new functions or even new code blocks) are entered or left.

In CUDA most functions are inlined, and typically the scopes are not so deep as on the CPU, as most kernels are compact.

So the stack mostly/effectively provides a flat thread-specific memory area for local variables.

In addition to what @Robert_Crovella nicely explained: The compiler/assembler mostly does the best decision by itself, but there are many cases, where your program forces it to do the spills and loosing performance. Reasons:

Too many/large local variables
Too restrictive launch bounds or maxrregcount parameter
Dynamic (run-time or thread-specific) indices into local arrays
Too many (which can lead to the loads being done upfront and taking too many registers; this is seldom) or too few unrolled loops (leading to dynamic indices)
Function calls copying many parameters or not being inlined

For those the register spills are a good indicator, what to hunt down to make the program often much faster.

202476410arsmart · December 29, 2024, 10:10am

“Too restrictive launch bounds or maxrregcount parameter” This is an interesting question. You mean, the actual reg usage is 10, but if we set it as 8, then 2 regs will spill to global, right? But seems the reality is not that clear… like here:

github.com

Dao-AILab/flash-attention/blob/0dfb28174333d9eefb7c1dd4292690a8458d1e89/hopper/flash_bwd_kernel.h#L69-L72


      
          // If you want to print from the producer warp, you'd need to increase the number of registers
          // Otherwise you'll get CUDA error.
          // static constexpr uint32_t LoadRegisterRequirement = 56;
          // static constexpr uint32_t MmaRegisterRequirement = 224;

If the reg number is not correctly set, we will get cuda error. They will not smoothly, automatically, be spilled to global.

For further discussion, I put the above content in another post:

system · January 12, 2025, 10:10am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Questions about CUDA 4.0 Build Logs Can someone define "stack frame", "spill stores" CUDA Programming and Performance	1	6975	May 18, 2011
A basic question about nvcc: stack frame, spilling ld/st CUDA Programming and Performance	8	3411	April 4, 2014
Once again for registry spills, performance and nvcc magic CUDA Programming and Performance	9	6372	September 6, 2013
call stack/ stack frame CUDA Programming and Performance	28	9129	October 10, 2014
Understanding PTXAS output CUDA Programming and Performance cuda	3	27	May 3, 2025
Cuda compiler will optimize code to use more registers than available by attempting to cache parameters CUDA Programming and Performance	12	2287	November 14, 2017
What's the difference between CUDA stack and local memory? CUDA Programming and Performance	3	500	September 13, 2024
Will register spills be compartmentalized? CUDA Programming and Performance	2	559	April 9, 2018
--ptxas-options=-v info inquiry CUDA Programming and Performance	4	733	March 24, 2024
optimizing registers by using shared memory when specifying -maxregcount maximizing the utility of s CUDA Programming and Performance	13	11870	March 3, 2010

Differences Between Stack Frame, Spill Stores, and Spill Loads

Related topics