Weird CUDA problem: changing += to /= in a loop causes a variable not to be set


We got some feedback from our internal team.
When you specify the GPU architecture, please also take care of the registers bound.

Register pressure occurs when there are not enough registers available for a given task. Even though each multiprocessor contains thousands of 32-bit registers (see Features and Technical Specifications of the CUDA C++ Programming Guide), these are partitioned among concurrent threads. To prevent the compiler from allocating too many registers, use the -maxrregcount=N compiler command-line option (see nvcc) or the launch bounds kernel definition qualifier (see Execution Configuration of the CUDA C++ Programming Guide) to control the maximum number of registers to allocated per thread.

For example, you should use -maxrregcount=32 for Nano.
This can be calculated via the information from the deviceQuery.

#maxrregcount = #Max register / # Max threads = 32768 / 1024 = 32

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X1"
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024

As a result, we can run your app successfully with the following nvcc command:

$ nvcc -gencode arch=compute_53,code=sm_53 -maxrregcount=32 && ./a.out
Capture resolution: 1280x720
Rectangle size: 4x120
CUDA analysis area: 1280x720
CUDA threads used: 1920 (93%).
webcamBufs.boxMinMean[0] = 37.


