Weird CUDA problem: changing += to /= in a loop causes a variable not to be set

I have a CUDA kernel that breaks with the error:

too many resources requested for launch

when I run it with “/=” in the loop in that kernel. If I change the “/=” to “+=” then the kernel runs fine.

Please see the attached file for a test case.

test_code2.cu (5.1 KB)

The code tries to launch 160 x 6 x 2 threads (xthreads x ythreads x blocks). If I reduce that to 128 (but no more) x 6 x 2 threads it runs fine.

Compile it with:

/usr/local/cuda-10.2/bin/nvcc -gencode arch=compute_53,code=sm_53 test_code2.cu

Thanks. I’ve been looking at this for 4 days and cannot see why it is not working.

P.S.
This thread was started over here https://forums.developer.nvidia.com/t/weird-cuda-problem-changing-to-in-a-loop-causes-a-variable-not-to-be-set/188871 but could not continue as the expert did not have a Jetson Nano in their possession.

Hi,

Thanks for reporting this issue.

We are checking this internally.
Will share more information with you later.

1 Like

Hi,

Please try the following compiling command:

$ /usr/local/cuda-10.2/bin/nvcc test_code2.cu

We test this on Nano with JetPack4.5.1/Jetapck4.6, it can work without any error.

$ ./a.out
Capture resolution: 1280x720
Rectangle size: 4x120
CUDA analysis area: 1280x720
CUDA threads used: 1920 (93%).
webcamBufs.boxMinMean[0] = 37.

Thanks.

1 Like

Okay, I realise that this fixes the problem but the L4T MMAPI samples all use

-gencode arch=compute_53,code=sm_53

in their Makefiles. See ./samples/common/algorithm/cuda/Makefile:

GENCODE_SM53 := -gencode arch=compute_53,code=sm_53
GENCODE_FLAGS := $(GENCODE_SM53) $(GENCODE_SM62) $(GENCODE_SM72) $(GENCODE_SM_PTX)

NvCudaProc.o : NvCudaProc.cpp
	@echo "Compiling: $<"
	$(NVCC) $(ALL_CPPFLAGS) $(GENCODE_FLAGS) -o $@ -c $<

Same in ./samples/v4l2cuda/Makefile:GENCODE_SM53

Why does Nvidia use this in the Makefiles if it doesn’t work?

Is “-gencode” fundamentally broken for Jetson Nano?

Thanks.

The example .cu file I presented to you is only a limited version of the problem I am seeing. The full version of the loop is as seen below:

for(x_offset = 0; x_offset < cu.rectWidth; x_offset++)
	{
		for(y_offset = 0; y_offset < cu.rectHeight; y_offset++)
		{
			offset = (start_row + y_offset) * cu.fbWidth + start_col + x_offset;
			buf.IavgF[offset] /= buf.Icount[box_offset];
			buf.IdiffF[offset] /= buf.Icount[box_offset];
            buf.IdiffF[offset] += 1;
			buf.IhiF[offset] = buf.IavgF[offset] + (buf.IdiffF[offset] * high_thresh);
			buf.IlowF[offset] = buf.IavgF[offset] - (buf.IdiffF[offset] * low_thresh);
            buf.boxMinMean[box_offset] = buf.IavgF[offset];
            
            // There is a "too many resources requested for launch" error UNLESS the next 2 lines are commented out.
            buf.boxMaxMean[box_offset] = buf.IavgF[offset];
            buf.boxMaxAvFrameDiff[box_offset] = buf.IdiffF[offset];
		}
	}

Please see the updated attached file - all I have done is allocate memory for the additional buffers. Why do I get the “too many resources requested for launch” error?

tuesday.cu (6.6 KB)

I have only allocated 19,361,280 bytes of device memory and 5760 bytes of unified memory.
Thanks again.

Not sure for your case, but I’ve seen similar error when using –device-debug.

1 Like

Hi,

It should work when compiling a CUDA app with -gencode arch=compute_53,code=sm_53 on Nano.
Our internal team is checking the cause currently.
Will share more information once we got any feedback.

Thanks.

1 Like

Interesting, but I am not using that switch.

Hi,

We got some feedback from our internal team.
When you specify the GPU architecture, please also take care of the registers bound.

https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#register-pressure

Register pressure occurs when there are not enough registers available for a given task. Even though each multiprocessor contains thousands of 32-bit registers (see Features and Technical Specifications of the CUDA C++ Programming Guide), these are partitioned among concurrent threads. To prevent the compiler from allocating too many registers, use the -maxrregcount=N compiler command-line option (see nvcc) or the launch bounds kernel definition qualifier (see Execution Configuration of the CUDA C++ Programming Guide) to control the maximum number of registers to allocated per thread.

For example, you should use -maxrregcount=32 for Nano.
This can be calculated via the information from the deviceQuery.

#maxrregcount = #Max register / # Max threads = 32768 / 1024 = 32

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X1"
...
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
...

As a result, we can run your app successfully with the following nvcc command:

$ nvcc -gencode arch=compute_53,code=sm_53 -maxrregcount=32 test.cu && ./a.out
Capture resolution: 1280x720
Rectangle size: 4x120
CUDA analysis area: 1280x720
CUDA threads used: 1920 (93%).
webcamBufs.boxMinMean[0] = 37.

Thanks.

1 Like

Thank you so much. I would not have worked that out by myself very easily.