Hi,
We got some feedback from our internal team.
When you specify the GPU architecture, please also take care of the registers bound.
https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#register-pressure
Register pressure occurs when there are not enough registers available for a given task. Even though each multiprocessor contains thousands of 32-bit registers (see Features and Technical Specifications of the CUDA C++ Programming Guide), these are partitioned among concurrent threads. To prevent the compiler from allocating too many registers, use the -maxrregcount=N compiler command-line option (see nvcc) or the launch bounds kernel definition qualifier (see Execution Configuration of the CUDA C++ Programming Guide) to control the maximum number of registers to allocated per thread.
For example, you should use -maxrregcount=32
for Nano.
This can be calculated via the information from the deviceQuery.
#maxrregcount = #Max register / # Max threads = 32768 / 1024 = 32
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA Tegra X1"
...
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
...
As a result, we can run your app successfully with the following nvcc command:
$ nvcc -gencode arch=compute_53,code=sm_53 -maxrregcount=32 test.cu && ./a.out
Capture resolution: 1280x720
Rectangle size: 4x120
CUDA analysis area: 1280x720
CUDA threads used: 1920 (93%).
webcamBufs.boxMinMean[0] = 37.
Thanks.