Kernel WORKS in Release mode, "too many resources requested for launch" in Debug mode

I have a large kernel with many input arguments and lots of processing in each thread. When I run this program in release mode (flags -O3), everything works fine, but when I run it in debug mode (flags -g -G), I get a “too many resources requested for launch” error. I am sure that I am not exceeding grid/block dimensions or shared memory limits, so I am quite sure this error is a result of register overflow.

What I don’t understand is why this only happens in debug mode and is fine in release mode. I am using CUDA-GDB within Visual Studio Code on Centos7.

Unfortunately I cannot post the code since it is classified.

Hi @will.ransom, thank you for your report!

but when I run it in debug mode (flags -g -G), I get a “too many resources requested for launch” error.

Do you see this error when you run the debug application without the debugger? Or only when the application is debugged with cuda-gdb?

so I am quite sure this error is a result of register overflow.

This is very likely the case. For release builds, compiler implements a number of optimizations, which might affect register pressure (number of live registers during program execution). E.g. register, occupied by a variable, which is no longer referenced, can be reused when program is built in release mode.

@AKravets I get this error both with and without the debugger. It happens no matter what if I compile the program without optimization.

So if this is indeed a register issue and the reason it works in release mode is because the optimization reduces register usage, is the only solution for running in debug mode to reduce the register usage of the kernel? What are the best ways to do this? Should I aim to reduce the number of input arguments, the size of the input arguments, the local memory used per thread, the number of computations per thread, or all the above?

I do not want to reduce the number of threads per block if possible since this will increase the number of times I will have to do global → shared memory copies.

Thanks for the reply!

This question might be better suited for CUDA Programming and Performance - NVIDIA Developer Forums forum branch - I have moved it there.

resource utilization (including registers used) may vary between release and debug modes.

You can inspect this yourself using -Xptxas=-v added to the nvcc compile command line.

The only direct methods I am aware of are either the -maxrregcount switch provided for nvcc which globally affects all kernels compiled by that compile command, or else the __launch_bounds__ methodology. There are many forum questions on these topics and both are documented.

-maxrregcount documentation

__launch_bounds__ documentation

If -Xptxas -v indicates a lot of spilling or high lmem usage in debug builds that is not there i release builds, I wonder whether lack of local memory could be the issue rather than running out of registers. Thread-local data “lives” in local memory by default. As an optimization the compiler tries to buffer it in registers as much and as long as possible. But observation indicates that debug builds seem to disable pretty much all optimizations.

The “too many registers used” hypothesis may be testable in slightly more detail in conjunction with the -Xptxas -v flag. In release builds, when the ptxas optimization level is reduced step by step, does the same issue appear? That is, lowering from the default-Xptxas -O3 down to -Xptxas -O0. My observation is that debug builds use something worse than -O0, i.e. something that looks like “active pessimization”, presumably to make all variables easily accessible in a debugger.

Does the code contain many common sub-expressions that do not get eliminated by CSE (common subexpression elimination) in debug builds, because in these all optimizations are turned off? This might be caused by using numerous or nested macro expansions, for example. In that case source-level elimination of common subexpressions and manual creation of temporary variables may help.

I am not sure how the compiler handles function inlining in debug builds vs release builds, but that can also influence register usage. It might be interesting to observe the register pressure changes from adding __noinline__ and __forceinline__ attributes to some key functions. Maybe there is a function that, when inlined, really drives up register pressure, but where the positive performance benefit of inlining is small in release builds.

The gist of what I am saying is that there may be relatively simple source code transformations that do not reduce the performance and maintainability of the code, but help with the debug build issue. Without seeing the code it is hard to come up with specific recommendations of what to try, so experimenting with changes while keeping an eye on the output from -Xptxas -v would be the way forward.

1 Like

I agree, and don’t know the answer. There is not enough info here to do much diagnosis, but I probably should not have “shrunk the focus” onto just register usage.