Weird CUDA problem: changing += to /= in a loop causes a variable not to be set

I have a CUDA kernel that breaks with the error:

too many resources requested for launch

when I run it with “/=” in the loop in that kernel. If I change the “/=” to “+=” then the kernel runs fine.

Please see the attached file for a test case. (5.1 KB)

The code tries to launch 160 x 6 x 2 threads (xthreads x ythreads x blocks). If I reduce that to 128 (but no more) x 6 x 2 threads it runs fine.

Compile it with:

/usr/local/cuda-10.2/bin/nvcc -gencode arch=compute_53,code=sm_53

Thanks. I’ve been looking at this for 4 days and cannot see why it is not working.

This thread was started over here but could not continue as the expert did not have a Jetson Nano in their possession.


Thanks for reporting this issue.

We are checking this internally.
Will share more information with you later.

1 Like


Please try the following compiling command:

$ /usr/local/cuda-10.2/bin/nvcc

We test this on Nano with JetPack4.5.1/Jetapck4.6, it can work without any error.

$ ./a.out
Capture resolution: 1280x720
Rectangle size: 4x120
CUDA analysis area: 1280x720
CUDA threads used: 1920 (93%).
webcamBufs.boxMinMean[0] = 37.


1 Like

Okay, I realise that this fixes the problem but the L4T MMAPI samples all use

-gencode arch=compute_53,code=sm_53

in their Makefiles. See ./samples/common/algorithm/cuda/Makefile:

GENCODE_SM53 := -gencode arch=compute_53,code=sm_53

NvCudaProc.o : NvCudaProc.cpp
	@echo "Compiling: $<"

Same in ./samples/v4l2cuda/Makefile:GENCODE_SM53

Why does Nvidia use this in the Makefiles if it doesn’t work?

Is “-gencode” fundamentally broken for Jetson Nano?


The example .cu file I presented to you is only a limited version of the problem I am seeing. The full version of the loop is as seen below:

for(x_offset = 0; x_offset < cu.rectWidth; x_offset++)
		for(y_offset = 0; y_offset < cu.rectHeight; y_offset++)
			offset = (start_row + y_offset) * cu.fbWidth + start_col + x_offset;
			buf.IavgF[offset] /= buf.Icount[box_offset];
			buf.IdiffF[offset] /= buf.Icount[box_offset];
            buf.IdiffF[offset] += 1;
			buf.IhiF[offset] = buf.IavgF[offset] + (buf.IdiffF[offset] * high_thresh);
			buf.IlowF[offset] = buf.IavgF[offset] - (buf.IdiffF[offset] * low_thresh);
            buf.boxMinMean[box_offset] = buf.IavgF[offset];
            // There is a "too many resources requested for launch" error UNLESS the next 2 lines are commented out.
            buf.boxMaxMean[box_offset] = buf.IavgF[offset];
            buf.boxMaxAvFrameDiff[box_offset] = buf.IdiffF[offset];

Please see the updated attached file - all I have done is allocate memory for the additional buffers. Why do I get the “too many resources requested for launch” error? (6.6 KB)

I have only allocated 19,361,280 bytes of device memory and 5760 bytes of unified memory.
Thanks again.

Not sure for your case, but I’ve seen similar error when using –device-debug.

1 Like


It should work when compiling a CUDA app with -gencode arch=compute_53,code=sm_53 on Nano.
Our internal team is checking the cause currently.
Will share more information once we got any feedback.


1 Like

Interesting, but I am not using that switch.


We got some feedback from our internal team.
When you specify the GPU architecture, please also take care of the registers bound.

Register pressure occurs when there are not enough registers available for a given task. Even though each multiprocessor contains thousands of 32-bit registers (see Features and Technical Specifications of the CUDA C++ Programming Guide), these are partitioned among concurrent threads. To prevent the compiler from allocating too many registers, use the -maxrregcount=N compiler command-line option (see nvcc) or the launch bounds kernel definition qualifier (see Execution Configuration of the CUDA C++ Programming Guide) to control the maximum number of registers to allocated per thread.

For example, you should use -maxrregcount=32 for Nano.
This can be calculated via the information from the deviceQuery.

#maxrregcount = #Max register / # Max threads = 32768 / 1024 = 32

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X1"
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024

As a result, we can run your app successfully with the following nvcc command:

$ nvcc -gencode arch=compute_53,code=sm_53 -maxrregcount=32 && ./a.out
Capture resolution: 1280x720
Rectangle size: 4x120
CUDA analysis area: 1280x720
CUDA threads used: 1920 (93%).
webcamBufs.boxMinMean[0] = 37.