Differences Between Stack Frame, Spill Stores, and Spill Loads

What are the differences between Stack Frame (bytes), Spill Stores (bytes), and Spill Loads (bytes)?

Definitions here.

1 Like

Thanks! Well, still vague… For me as a cuda programmer, maybe these three variables are similar? The larger the worse?

a stack is not a concept that is unique or specific to CUDA. If you want to know what a stack is or why it is used in modern processors, you should be able to find plenty of resources on the web for that. A stack frame is simply the space utilized by the stack, or the space utilized to conduct a particular operation on/with the stack (such as a function or subroutine call.)

spill stores and spill loads relate to the usage of variables in the logical local space. Such variables may manifest in a register, or they may manifest in DRAM memory, or perhaps the caches. The GPU is a load/store architecture for the most part, so when it comes time to use variables in calculations, they almost universally manifest in GPU registers.

A logical local space variable might appear in source code like:

int a;

So if there were no other constraints, the compiler would simply choose to “locate” or “manifest” all logical local space variables in registers. But, unfortunately, there are other constraints. Two of the most relevant are:

  1. there is not an unlimited supply of registers
  2. registers cannot be indexed, that is I cannot select a register as an operand, based on the numerical contents of another register

Because of this, the compiler may choose to “locate” or “manifest” logical local space variables at certain points in execution in registers, and at other points in execution in memory (or cache). Since the “location” of a variable may change over the duration of code execution, “movement” of the variable may be necessary.

The SASS instruction that would typically be used to “send” a local variable from register to memory would be STL. The SASS instruction that would be used to “send” a local variable from memory to register would be LDL. When the compiler chooses to relocate a logical local space variable from a register to memory (with the intent of loading it later, of course) that would be referred to as a “spill store”, accomplished via STL. When the compiler chooses to relocate a logical local space variable from memory to register that would be referred to as a “spill load”.

In an ideal machine (infinite supply of registers, etc.), we would prefer not to have either of these mechanisms taking place. They “cost” something. It takes an instruction issue slot, and it also accesses the memory subsystem in some fashion, with all that that implies. However in some cases, the compiler decides they are “necessary”. Like many other performance considerations in a GPU, there is a tradeoff involved. Its usually not sensible just to take the attitude that “spills are bad” because they may indeed be “necessary”. But if you can refactor a code to reduce spill loads and spill stores, it may result in increased performance.

Mindless techniques such as using register controls to increase register usage are generally not that effective, in my experience. The compiler generally makes pretty good decisions about register usage. The compiler is ultimately after maximum delivered performance, not any other metric as a first order optimization goal. Therefore if the compiler is generating spill operations, it generally believes that is the best compromise to achieve best performance.

YMMV. Exceptions and bugs are always possible. Experimentation and variation of parameters such as register limits may produce faster code.

You can find many forum articles discussing spills.

The stack frame and e.g. STL instructions may be used for other purposes; I already mentioned one possible use - as a data repository for arguments passed to a function/subroutine call (again, not unique or specific to CUDA or GPUs). Therefore, the simple presence of STL instructions in your SASS dump does not necessarily mean that spill stores are occurring; use the available compiler output to determine that conclusively.

1 Like

Generally in computer architectures the stack of a thread (and not stacks in other used data structures) is mostly used for storing local variables and parameters of functions.

The local variables extend and contract, when new scopes (new functions or even new code blocks) are entered or left.

In CUDA most functions are inlined, and typically the scopes are not so deep as on the CPU, as most kernels are compact.

So the stack mostly/effectively provides a flat thread-specific memory area for local variables.

In addition to what @Robert_Crovella nicely explained: The compiler/assembler mostly does the best decision by itself, but there are many cases, where your program forces it to do the spills and loosing performance. Reasons:

  • Too many/large local variables
  • Too restrictive launch bounds or maxrregcount parameter
  • Dynamic (run-time or thread-specific) indices into local arrays
  • Too many (which can lead to the loads being done upfront and taking too many registers; this is seldom) or too few unrolled loops (leading to dynamic indices)
  • Function calls copying many parameters or not being inlined

For those the register spills are a good indicator, what to hunt down to make the program often much faster.

1 Like

“Too restrictive launch bounds or maxrregcount parameter” This is an interesting question. You mean, the actual reg usage is 10, but if we set it as 8, then 2 regs will spill to global, right? But seems the reality is not that clear… like here:

If the reg number is not correctly set, we will get cuda error. They will not smoothly, automatically, be spilled to global.

For further discussion, I put the above content in another post:

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.