Weird CUDA problem: changing += to /= in a loop causes a variable not to be set

AastaLLL · September 24, 2021, 4:17am

Hi,

We got some feedback from our internal team.
When you specify the GPU architecture, please also take care of the registers bound.

https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#register-pressure

Register pressure occurs when there are not enough registers available for a given task. Even though each multiprocessor contains thousands of 32-bit registers (see Features and Technical Specifications of the CUDA C++ Programming Guide), these are partitioned among concurrent threads. To prevent the compiler from allocating too many registers, use the -maxrregcount=N compiler command-line option (see nvcc) or the launch bounds kernel definition qualifier (see Execution Configuration of the CUDA C++ Programming Guide) to control the maximum number of registers to allocated per thread.

For example, you should use -maxrregcount=32 for Nano.
This can be calculated via the information from the deviceQuery.

#maxrregcount = #Max register / # Max threads = 32768 / 1024 = 32

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X1"
...
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
...

As a result, we can run your app successfully with the following nvcc command:

$ nvcc -gencode arch=compute_53,code=sm_53 -maxrregcount=32 test.cu && ./a.out
Capture resolution: 1280x720
Rectangle size: 4x120
CUDA analysis area: 1280x720
CUDA threads used: 1920 (93%).
webcamBufs.boxMinMean[0] = 37.

Thanks.