Ideas on reducing stack frame?

I have a kernel with nested device functions.
Some of the functions have many int* parameters.
The kernel is slow:-(
At present ptxas-options=-v reports “64 bytes stack frame”.
Does anyone know of compiler options (or other ideas) for
reducing this (ideally to zero).
The kernel seams to have registers over.

Should I use/not use inline in the source code?
Should I use no-inline on the nvcc command line?

As always comments and suggestions would be most welcome.
Many thanks

ps: it appears the stack from is held in thread local memory:
which is why I suspect it is (part of) the slowness.

Yes, the stack frame for functions is allocated in local memory. Realistically, that is the only place it can go. Since there can be divergent execution flow across function calls, the storage has to be per-thread which means the combined memory for all stack frames is fairly large.

Can you show the code and how you are compiling it (= exact nvcc command line) as well as the full output from -Xptxas -v? The use of non-empty stack frames could be due to function argument passing (for example when you choose separate compilation), register spilling, or arrays local to the function(s) that are either “large” or use variable indexing.

While access to local memory (a per-thread mapping of a portion of global memory) is slow, the accesses are cached in L1 and with 64 bytes per thread it just may fit (mostly) into the cache. So this may not be as much of a performance killer as you suspect and there may be other, more important bottlenecks that cause the kernel to be slow. The CUDA profiler can extract relevant metrics and provide guidance.

If you simply wish to experiment based on the hypothesis that the issue is function argument passing, you could try to use whole program compilation (all code for a kernel is part of a single compilation unit), and in addition use the forceinline attribute on all functions. That could easily backfire if this increases register utilization that in turn introduces register spilling leading to local memory access.

[Later:] Are you using trig functions in your code by any chance? The slowpath for the argument reduction in those functions uses a local array to minimize register usage. Unless you pass very large arguments to the trig functions, that small array is never accessed. This is mentioned in the CUDA C Programming Guide (for CUDA 6.5, on page 85 printed).

@wlangdon, a couple additions…

You can force ptxas to not use an ABI at all with the “-Xptxas=-abi=no” option.

I’ve found the no-ABI option helpful in diagnosing the total register footprint of a “flattened” kernel. If you see local spills with this option then, most likely, you don’t have enough registers, your launch bounds are too tight, or you’re performing a random-access on registers. If it’s the last case, you should fix your code or use shared memory… or just let local memory handle it.

You note that you have nested functions. Another thing to check is that your function prototypes aren’t incorrect (or imprecise) and are forcing registers to mistakenly be copied by value.

Oh, and make sure you don’t have some printf()'s hiding in your kernel! :)

Dear allanmac and njuffa,
Just a quick reply to report progress so far.

You were right I had forgotten I had left some printf() in the code (for debug).
They were not being used.
Nevertheless removing the 3 printf and using the -abi=no causes the “stack frame”
to be removed from nvcc’s output and the number of registers to fall.

Also as you suggested this did not a huge difference to performance
(trying -dlcm=cg next)

Many thanks.

ps: As reported in
ABI is needed for printf in a kernel. Before printf was removed nvcc failed with:

ptxas warning : ‘option -abi=no’ might get deprecated in future
ptxas fatal : Unresolved extern function ‘vprintf’

The ABI is needed for many modern CUDA features. There is a reason ABIs exist and are mandatory for all CPUs I have used in the past twenty years, the same reasons apply to the GPU. There was no ABI on sm_1x devices because this old hardware had too many limitations. Other than for a quick experiment, one would not want to turn off use of the ABI. Personally I hope that -abi=no will be discontinued sooner rather than later, it has been deprecated long enough.

@njuffa: Yup, note that I consider the “no ABI” switch as a diagnostic tool. The option is also useful for estimating if a kernel and its algorithms will fit in an abstract and possibly more primitive GPU.

In my 4+ years of GPU compute development I’ve found that GPU compilers ship with bugs. Lots of them. Trying to understand if the bug is my fault or the compiler’s is a big time suck with no upside. Finding a workaround is costly and pretty much always something you have to do yourself (NVIDIA has never provided me a workaround for anything despite reporting dozens of bugs). Don’t forget that your customers often have to wait 6-12 months for a fix to appear even if the bug is fixed internally. My hope is that the ABI switch and other helpful switches stick around for a little while longer.

@wlangdon, if you’re on Windows you really should look at NSight’s “Performance Analysis Experiments”. The source-level experiments are incredibly useful. Being able to sort your memory transaction, branch and instruction statistics lets you zero in on your hot spots. Give it a try if you’re on Windows!

As a former employee of NVIDIA, I am just a customer, too :-)

As a software developer I am aware that the more modes are added to a piece of software, the poorer test coverage tends to get overall since there are limited test resources. Rarely used modes may not even be tested at all. For that reason I am always in favor of removing modes (or other features) that no longer serve much of a useful purpose. My experience tells me that the long-term negative effects of accumulating croft in a code base over time periods of 10, 15, or 20 years can be quite severe. I apply this philosophy to all software: A component that is not there can’t be broken and can’t break anything else.

As a long-time software developer I am also well aware of the pain of compiler bugs, as I have used quite a few different ones over the years. I used to report compiler bugs to both commercial vendors and non-commercial projects, starting in the late 1980s. Invariably the suggestion of workarounds by the compiler vendor/project was a rare event, and time-to-fix averaged about 1.5 years, regardless of whether the compiler was commercial or non-commercial. So I think what you are experiencing with the CUDA tool chain is par for the course.

Many of those compilers have gone the way of the dodo, for quality reasons or otherwise. The ones that still exist today tend to have very low bug rates at this time best I can tell. But this is the status after maturing for (well) over 20 years. I do not know whether there is a better (faster) way to build robust tool chains, as I have never worked as a compiler engineer; instead I frequently interfaced with compiler engineers in the process of contributing optimizations. In practical terms, reporting bugs as they are encountered seems to be the best practice to follow.

sorry guys I guess I should have started with this question:
what does abi stand for?

ABI = application binary interface. Rules that regulate how data is stored, how data is passed to functions and how functions return results, etc. The write-up on this in Wikipedia looks reasonable:

The people on this forum will forever be your customers! :)