I have an application which makes heavy use of the GPU by using the OpenCV4Tegra library. The application runs anywhere from 10 minutes to two hours before the GPU hangs. When running X windows, the screen locks up. When ssh’ed into the TX-1, the application simply hangs. The application is pretty straight forward. In a loop, it reads two image files from a mounted USB stick, performs some image processing (image registration) then writes the results to a text file.
I would like some guidance on debugging the issue.
I’ve tried a few things and so far have not been able to determine root cause.
I have attached the output of nvidia-bug-report-tegra.sh script.
When the application hangs, I am able to ssh into the TX-1 and get a backtrace using gdb. I can then kill the application which allows the X windows to continue running as if nothing happened.
Before I kill the application, the backtrace shows the CPU is in nanosleep() and one of two CUDA calls has been made: cuMemFree_v2() or cuCtxSynchronize().
Here are the two types of backtraces I’ve seen:
Backtrace #1
Thread 44 (Thread 0x7f46ffe4b0 (LWP 7171)):
#0 0x0000007f76e28d78 in nanosleep () at …/sysdeps/unix/syscall-template.S:86
#1 0x0000007f76e4e308 in usleep (useconds=) at …/sysdeps/posix/usleep.c:32
#2 0x0000007f50dc44f4 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#3 0x0000007f50cdc600 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#4 0x0000007f50aa6d58 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#5 0x0000007f50d594c4 in cuMemFree_v2 () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#6 0x0000007f76fc5f2c in ?? () from /usr/local/cuda-8.0/targets/aarch64-linux/lib/libcudart.so.8.0
#7 0x0000000000d77a90 in ?? ()
Backtrace stopped: not enough registers or memory available to unwind further
Backtrace #2
Thread 2 (Thread 0x7eb9f424b0 (LWP 7288)):
#0 0x0000007f76e28d74 in nanosleep () at …/sysdeps/unix/syscall-template.S:86
#1 0x0000007f76e4e308 in usleep (useconds=) at …/sysdeps/posix/usleep.c:32
#2 0x0000007f50dc44f4 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#3 0x0000007f50cdc600 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#4 0x0000007f50cdcea0 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#5 0x0000007f50aa1a24 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#6 0x0000007f50d56eac in cuCtxSynchronize () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#7 0x0000007f76fc1b68 in ?? () from /usr/local/cuda-8.0/targets/aarch64-linux/lib/libcudart.so.8.0
#8 0x0000000000d77a90 in ?? ()
Backtrace stopped: not enough registers or memory available to unwind further
I’ve run “cuda-memcheck --tool memcheck” to confirm there is not memory access issues.
I’ve run “cuda-memcheck --tool racecheck” and the tool reports there is a race in the “void cv::gpu::minMaxLoc::kernel_pass_1” and “void cv::gpu::minMaxLoc::kernel_pass_2” functions. [See the cuda-racecheck-log_org.txt attachment].
When I change the application to use the CPU version of minMaxLoc (cv::minMaxLoc), the “cuda-memcheck --tool racecheck” tool no longer reports that there is a race condition. So that’s not the cause of my issue.
After I kill the application, dmesg shows the following message:
gk20a gpu.0: __locked_fifo_preempt: preempt TSG 0 timeout
as the start of a string of informative messages I’d like some help understanding.
All of the messages are shown in the nvidia-bug-report-tegra.log file.
I’d really like to be able to use cuda-gdb to get a backtrace but cuda-gdb does not run on the TX-1 I am using. It reports:
fatal: All CUDA devices are used for display and cannot be used while debugging. (error code = CUDBG_ERROR_ALL_DEVICES_WATCHDOGGED(0x18)
nvidia-bug-report-tegra.log (1.44 MB)
cuda-racecheck-log_org.txt (339 KB)