Kernel WORKS in Release mode, "too many resources requested for launch" in Debug mode

will.ransom · August 11, 2022, 12:57pm

I have a large kernel with many input arguments and lots of processing in each thread. When I run this program in release mode (flags -O3), everything works fine, but when I run it in debug mode (flags -g -G), I get a “too many resources requested for launch” error. I am sure that I am not exceeding grid/block dimensions or shared memory limits, so I am quite sure this error is a result of register overflow.

What I don’t understand is why this only happens in debug mode and is fine in release mode. I am using CUDA-GDB within Visual Studio Code on Centos7.

Unfortunately I cannot post the code since it is classified.

AKravets · August 11, 2022, 8:13pm

Hi @will.ransom, thank you for your report!

but when I run it in debug mode (flags -g -G), I get a “too many resources requested for launch” error.

Do you see this error when you run the debug application without the debugger? Or only when the application is debugged with cuda-gdb?

so I am quite sure this error is a result of register overflow.

This is very likely the case. For release builds, compiler implements a number of optimizations, which might affect register pressure (number of live registers during program execution). E.g. register, occupied by a variable, which is no longer referenced, can be reused when program is built in release mode.

will.ransom · August 12, 2022, 2:56pm

@AKravets I get this error both with and without the debugger. It happens no matter what if I compile the program without optimization.

So if this is indeed a register issue and the reason it works in release mode is because the optimization reduces register usage, is the only solution for running in debug mode to reduce the register usage of the kernel? What are the best ways to do this? Should I aim to reduce the number of input arguments, the size of the input arguments, the local memory used per thread, the number of computations per thread, or all the above?

I do not want to reduce the number of threads per block if possible since this will increase the number of times I will have to do global → shared memory copies.

Thanks for the reply!

AKravets · August 15, 2022, 2:50pm

This question might be better suited for CUDA Programming and Performance - NVIDIA Developer Forums forum branch - I have moved it there.

Robert_Crovella · August 15, 2022, 3:12pm

resource utilization (including registers used) may vary between release and debug modes.

You can inspect this yourself using -Xptxas=-v added to the nvcc compile command line.

The only direct methods I am aware of are either the -maxrregcount switch provided for nvcc which globally affects all kernels compiled by that compile command, or else the __launch_bounds__ methodology. There are many forum questions on these topics and both are documented.

-maxrregcount documentation

__launch_bounds__ documentation

njuffa · August 15, 2022, 7:52pm

If -Xptxas -v indicates a lot of spilling or high lmem usage in debug builds that is not there i release builds, I wonder whether lack of local memory could be the issue rather than running out of registers. Thread-local data “lives” in local memory by default. As an optimization the compiler tries to buffer it in registers as much and as long as possible. But observation indicates that debug builds seem to disable pretty much all optimizations.

The “too many registers used” hypothesis may be testable in slightly more detail in conjunction with the -Xptxas -v flag. In release builds, when the ptxas optimization level is reduced step by step, does the same issue appear? That is, lowering from the default-Xptxas -O3 down to -Xptxas -O0. My observation is that debug builds use something worse than -O0, i.e. something that looks like “active pessimization”, presumably to make all variables easily accessible in a debugger.

Does the code contain many common sub-expressions that do not get eliminated by CSE (common subexpression elimination) in debug builds, because in these all optimizations are turned off? This might be caused by using numerous or nested macro expansions, for example. In that case source-level elimination of common subexpressions and manual creation of temporary variables may help.

I am not sure how the compiler handles function inlining in debug builds vs release builds, but that can also influence register usage. It might be interesting to observe the register pressure changes from adding __noinline__ and __forceinline__ attributes to some key functions. Maybe there is a function that, when inlined, really drives up register pressure, but where the positive performance benefit of inlining is small in release builds.

The gist of what I am saying is that there may be relatively simple source code transformations that do not reduce the performance and maintainability of the code, but help with the debug build issue. Without seeing the code it is hard to come up with specific recommendations of what to try, so experimenting with changes while keeping an eye on the output from -Xptxas -v would be the way forward.

Robert_Crovella · August 15, 2022, 7:56pm

I agree, and don’t know the answer. There is not enough info here to do much diagnosis, but I probably should not have “shrunk the focus” onto just register usage.

system · December 23, 2022, 1:09pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Bug appears only when compiling to "release" How to track it down? CUDA Programming and Performance	15	4911	January 2, 2012
kernel launch error: 'too many resources requested for launch' CUDA Programming and Performance	4	2479	May 29, 2017
cuda-gdb and cudaCheckError() too many resources requested for launch CUDA Programming and Performance	1	567	March 22, 2019
Get error cudaErrorLaunchOutOfResources from kernel after removing -G compiler option CUDA Programming and Performance	1	1250	March 10, 2019
How to debug a program that only bugs in release mode? Debug and emu do not show the problem at all CUDA Programming and Performance	10	9583	April 7, 2010
Kernel doesn't launch in release mode, dependent on how much shared mem I allocate...? CUDA Programming and Performance	6	768	April 15, 2016
Error in CUDA 5.5 Debug mode:"too many resources requested for launch" CUDA Programming and Performance	4	1436	August 16, 2013
[NVCC BUG] Kernel seems to silently fail CUDA NVCC Compiler	2	578	November 2, 2021
Too many resources requested for launch Legacy PGI Compilers	3	8099	September 23, 2010
cudaErrorLaunchOutOfResources aka "too many resources requested for launch" CUDA Programming and Performance	3	10347	July 29, 2013

Kernel WORKS in Release mode, "too many resources requested for launch" in Debug mode

Related topics