A basic question about nvcc: stack frame, spilling ld/st


May I have your kindly help to understand following situation?

The stack frame size is really large (for example, several thousands Byte), but the spilling numnber for ld/st is really small (for example less than 100 each).

Thank you.


With CUDA’s ABI, the stack frame is used in much the same way as it is with ABIs for CPUs. Function arguments and thread local variables are stored there, and it serves as a backing store for spilled registers. The stack frame is allocated from local memory. If your stack frame is extremeley large, it may exceed the default local memory allocation, in which case you may need to increase the allocation (I don’t quite remember the relevant CUDA API call at the moment).

Not knowing anything about your code, it is hard to guess what is occupying the stack frame. Does the code use a sizeable local array, or a very complex C++ class maybe?

There is a large integer array a[several KB]. Each element is accessed. If these data is supposed to be in lmem, I wish to understand why the spill number is small?

Local memory use (and stack frame use in particular) is not the same as register spilling. Providing storage for register spilling is just one of the uses of the stack frame and thus local memory.

Local memory is used to store thread-local data. All thread-local data is by default allocated in local memory. As an optimization, the compiler will attempt to place some of that data in registers instead. However, registers are a tightly limited resource and cannot be indexed. Arrays have to be both small (as determined by a compiler heuristic) and be accessed exclusively via compile-time constant indexes to be placed into registers. Local arrays that are large (like the one in your code), or have runtime variable indexing must remain in local memory.

In some cases the compiler makes a decision to place a local variable into a register, only to find later that there is insufficient register storage to hold all the variables. In those cases it will temporarily unload the register data for some of these variables to local memory, and reload it from there later. This process is called register spilling, and is a technique commonly used by compilers on both CPUs and GPUs.

Oh, I c. So the array might have nothing to do with spilling, the compiler could just put it into lmem instead of using any register. Thank you. My original understanding is: all CUDA thread-local data (auto variables) are by default reside in register… That’s the misunderstanding point.

Since the compiler usually pulls scalar variables into registers right away, one could easily come to that conclusion through casual observation. The actual allocation logic is complex and involves various heuristics (at least some of which are architecture dependent). My explanation above should be construed as a simple but useful mental model as to what is happening under the hood.

The basic message is this: Not all local memory use is due to register spilling, therefore the size of the stack frame may need to be much larger than what is required for register spilling.

Thank you very much, I c.


As a followup to #2, the CUDA API functions I was thinking of are these:

size_t stackSize, newStackSize;
cudaThreadGetLimit(&stackSize, cudaLimitStackSize);   // get stack size
cudaThreadSetLimit(cudaLimitStackSize, newStackSize); // set stack size