Why does the compiler use stack frame instead of registers?

When I compile some kernels, the compiler seems to prefer stack frame (which I guess goes into slow local memory) instead of registers, why is that? I have forceinline on all my help functions.

For the following kernel, why can’t the compiler use another 4 float registers instead of using 16 bytes of stack frame?

ptxas info : Function properties for convolutionColumns
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 14 registers, 16384 bytes smem, 372 bytes cmem[0]

It is impossible to tell without additional information. Note that mere presence of the stack frame is not indicative of a problem, including performance issues. Use of the stack frame is generally related to the use of a proper ABI (since CUDA 3.0, I think). Use of an ABI is a pre-requisite for device-side printf(), various C++ feature, etc. The compiler may or may not be able to optimize away the need for a stack frame.

For an experiment, you can try to build with -Xptxas -abi=no. Note that I would not recommend this for production builds as it may become deprecated.

I do not have a good understanding under which conditions the compiler is able to remove the stack frame (it has never come up as an issue in my work). Experimentally, I observe that functions that make calls to noinline device functions appear to use a non-zero stack frame. For example, the double-precision trig functions sin(), cos(), sincos(), tan() implement their slowpath as a called, non-inlined, subroutine, and functions that invoke these DP trig functions appear to wind up with a non-zero stackframe.

The compiler should give an advisory message if it cannot honor the forceinline attribute, are you seeing such messages? You can also inspect the generated code to see whether inlining took place.