I have a CUDA kernel that breaks with the error:
too many resources requested for launch
when I run it with “/=” in the loop in that kernel. If I change the “/=” to “+=” then the kernel runs fine.
Please see the attached file for a test case.
test_code2.cu (5.1 KB)
The code tries to launch 160 x 6 x 2 threads (xthreads x ythreads x blocks). If I reduce that to 128 (but no more) x 6 x 2 threads it runs fine.
Compile it with:
/usr/local/cuda-10.2/bin/nvcc -gencode arch=compute_53,code=sm_53 test_code2.cu
Thanks. I’ve been looking at this for 4 days and cannot see why it is not working.
P.S.
This thread was started over here https://forums.developer.nvidia.com/t/weird-cuda-problem-changing-to-in-a-loop-causes-a-variable-not-to-be-set/188871
but could not continue as the expert did not have a Jetson Nano in their possession.
Hi,
Thanks for reporting this issue.
We are checking this internally.
Will share more information with you later.
1 Like
Hi,
Please try the following compiling command:
$ /usr/local/cuda-10.2/bin/nvcc test_code2.cu
We test this on Nano with JetPack4.5.1/Jetapck4.6, it can work without any error.
$ ./a.out
Capture resolution: 1280x720
Rectangle size: 4x120
CUDA analysis area: 1280x720
CUDA threads used: 1920 (93%).
webcamBufs.boxMinMean[0] = 37.
Thanks.
1 Like
Okay, I realise that this fixes the problem but the L4T MMAPI samples all use
-gencode arch=compute_53,code=sm_53
in their Makefiles. See ./samples/common/algorithm/cuda/Makefile:
GENCODE_SM53 := -gencode arch=compute_53,code=sm_53
GENCODE_FLAGS := $(GENCODE_SM53) $(GENCODE_SM62) $(GENCODE_SM72) $(GENCODE_SM_PTX)
NvCudaProc.o : NvCudaProc.cpp
@echo "Compiling: $<"
$(NVCC) $(ALL_CPPFLAGS) $(GENCODE_FLAGS) -o $@ -c $<
Same in ./samples/v4l2cuda/Makefile:GENCODE_SM53
Why does Nvidia use this in the Makefiles if it doesn’t work?
Is “-gencode” fundamentally broken for Jetson Nano?
Thanks.
The example .cu file I presented to you is only a limited version of the problem I am seeing. The full version of the loop is as seen below:
for(x_offset = 0; x_offset < cu.rectWidth; x_offset++)
{
for(y_offset = 0; y_offset < cu.rectHeight; y_offset++)
{
offset = (start_row + y_offset) * cu.fbWidth + start_col + x_offset;
buf.IavgF[offset] /= buf.Icount[box_offset];
buf.IdiffF[offset] /= buf.Icount[box_offset];
buf.IdiffF[offset] += 1;
buf.IhiF[offset] = buf.IavgF[offset] + (buf.IdiffF[offset] * high_thresh);
buf.IlowF[offset] = buf.IavgF[offset] - (buf.IdiffF[offset] * low_thresh);
buf.boxMinMean[box_offset] = buf.IavgF[offset];
// There is a "too many resources requested for launch" error UNLESS the next 2 lines are commented out.
buf.boxMaxMean[box_offset] = buf.IavgF[offset];
buf.boxMaxAvFrameDiff[box_offset] = buf.IdiffF[offset];
}
}
Please see the updated attached file - all I have done is allocate memory for the additional buffers. Why do I get the “too many resources requested for launch” error?
tuesday.cu (6.6 KB)
I have only allocated 19,361,280 bytes of device memory and 5760 bytes of unified memory.
Thanks again.
Not sure for your case, but I’ve seen similar error when using –device-debug
.
1 Like
Hi,
It should work when compiling a CUDA app with -gencode arch=compute_53,code=sm_53
on Nano.
Our internal team is checking the cause currently.
Will share more information once we got any feedback.
Thanks.
1 Like
Interesting, but I am not using that switch.
Hi,
We got some feedback from our internal team.
When you specify the GPU architecture, please also take care of the registers bound.
https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#register-pressure
Register pressure occurs when there are not enough registers available for a given task. Even though each multiprocessor contains thousands of 32-bit registers (see Features and Technical Specifications of the CUDA C++ Programming Guide), these are partitioned among concurrent threads. To prevent the compiler from allocating too many registers, use the -maxrregcount=N compiler command-line option (see nvcc) or the launch bounds kernel definition qualifier (see Execution Configuration of the CUDA C++ Programming Guide) to control the maximum number of registers to allocated per thread.
For example, you should use -maxrregcount=32
for Nano.
This can be calculated via the information from the deviceQuery.
#maxrregcount = #Max register / # Max threads = 32768 / 1024 = 32
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA Tegra X1"
...
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
...
As a result, we can run your app successfully with the following nvcc command:
$ nvcc -gencode arch=compute_53,code=sm_53 -maxrregcount=32 test.cu && ./a.out
Capture resolution: 1280x720
Rectangle size: 4x120
CUDA analysis area: 1280x720
CUDA threads used: 1920 (93%).
webcamBufs.boxMinMean[0] = 37.
Thanks.
1 Like
Thank you so much. I would not have worked that out by myself very easily.