GPU Hangs When Using OpenCV on the Jetson TX-1

I have an application which makes heavy use of the GPU by using the OpenCV4Tegra library. The application runs anywhere from 10 minutes to two hours before the GPU hangs. When running X windows, the screen locks up. When ssh’ed into the TX-1, the application simply hangs. The application is pretty straight forward. In a loop, it reads two image files from a mounted USB stick, performs some image processing (image registration) then writes the results to a text file.

I would like some guidance on debugging the issue.

I’ve tried a few things and so far have not been able to determine root cause.

I have attached the output of nvidia-bug-report-tegra.sh script.

When the application hangs, I am able to ssh into the TX-1 and get a backtrace using gdb. I can then kill the application which allows the X windows to continue running as if nothing happened.

Before I kill the application, the backtrace shows the CPU is in nanosleep() and one of two CUDA calls has been made: cuMemFree_v2() or cuCtxSynchronize().

Here are the two types of backtraces I’ve seen:

Backtrace #1

Thread 44 (Thread 0x7f46ffe4b0 (LWP 7171)):
#0 0x0000007f76e28d78 in nanosleep () at …/sysdeps/unix/syscall-template.S:86
#1 0x0000007f76e4e308 in usleep (useconds=) at …/sysdeps/posix/usleep.c:32
#2 0x0000007f50dc44f4 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#3 0x0000007f50cdc600 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#4 0x0000007f50aa6d58 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#5 0x0000007f50d594c4 in cuMemFree_v2 () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#6 0x0000007f76fc5f2c in ?? () from /usr/local/cuda-8.0/targets/aarch64-linux/lib/libcudart.so.8.0
#7 0x0000000000d77a90 in ?? ()
Backtrace stopped: not enough registers or memory available to unwind further

Backtrace #2

Thread 2 (Thread 0x7eb9f424b0 (LWP 7288)):
#0 0x0000007f76e28d74 in nanosleep () at …/sysdeps/unix/syscall-template.S:86
#1 0x0000007f76e4e308 in usleep (useconds=) at …/sysdeps/posix/usleep.c:32
#2 0x0000007f50dc44f4 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#3 0x0000007f50cdc600 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#4 0x0000007f50cdcea0 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#5 0x0000007f50aa1a24 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#6 0x0000007f50d56eac in cuCtxSynchronize () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#7 0x0000007f76fc1b68 in ?? () from /usr/local/cuda-8.0/targets/aarch64-linux/lib/libcudart.so.8.0
#8 0x0000000000d77a90 in ?? ()
Backtrace stopped: not enough registers or memory available to unwind further

I’ve run “cuda-memcheck --tool memcheck” to confirm there is not memory access issues.

I’ve run “cuda-memcheck --tool racecheck” and the tool reports there is a race in the “void cv::gpu::minMaxLoc::kernel_pass_1” and “void cv::gpu::minMaxLoc::kernel_pass_2” functions. [See the cuda-racecheck-log_org.txt attachment].

When I change the application to use the CPU version of minMaxLoc (cv::minMaxLoc), the “cuda-memcheck --tool racecheck” tool no longer reports that there is a race condition. So that’s not the cause of my issue.

After I kill the application, dmesg shows the following message:

gk20a gpu.0: __locked_fifo_preempt: preempt TSG 0 timeout

as the start of a string of informative messages I’d like some help understanding.

All of the messages are shown in the nvidia-bug-report-tegra.log file.

I’d really like to be able to use cuda-gdb to get a backtrace but cuda-gdb does not run on the TX-1 I am using. It reports:

fatal: All CUDA devices are used for display and cannot be used while debugging. (error code = CUDBG_ERROR_ALL_DEVICES_WATCHDOGGED(0x18)
nvidia-bug-report-tegra.log (1.44 MB)
cuda-racecheck-log_org.txt (339 KB)

This is probably unrelated, but I notice your file system is almost filled. It has only 745MB left. I am wondering if perhaps your program may have needed temp files or other temporary increase of disk space used…even though the main read (and it seems write) work through the USB stick it would not be unusual for “/tmp” or “/var” to consume space only while the program is alive and running. I could see the possibility of some sort of race condition if temp file space was locked.

Perhaps you could run and watch this as the program hangs (in theory this should hang at the same time your program hangs so you wouldn’t need to sit and watch it…just put it on ssh):

watch df -h /

also for CPU/GPU usage and frequency while your program is running, you could use ./tegrastats.

Hi linuxdev,

Thanks for the suggestion.

watch df -h did not show a significant drop in the available disk space and any significant usage on /tmp or /var.

I ran the following command on separate ssh session:
watch df /

I had to remove the -h option to get the resolution needed to see that data available was actually decreasing as the program ran.

Start:
Every 2.0s: df / Wed Feb 1 13:25:46 2017
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mmcblk0p1 14318640 12799360 768896 95% /

Some time in the middle:
Every 2.0s: df / Wed Feb 1 13:26:30 2017
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mmcblk0p1 14318640 12799112 769144 95% /

Some time in the middle:
Every 2.0s: df / Wed Feb 1 13:30:14 2017
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mmcblk0p1 14318640 12799160 769096 95% /

Some time in the middle:
Every 2.0s: df / Wed Feb 1 13:37:46 2017
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mmcblk0p1 14318640 12799260 768996 95% /

Hi chijen,

I forgot to mention that I have tried tegrastats to offer insight into this issue.

About the only thing tegrastats tells me is the GR3D (the GPU?) is pegged at 99% during the hang.

WHEN APP IS NOT RUNNING:
ubuntu@tegra-ubuntu:~/build$ ~/tegrastats
RAM 2093/3994MB (lfb 151x4MB) cpu [0%,0%,0%,0%]@1734 GR3D 0%@76 EDP limit 0
RAM 2093/3994MB (lfb 151x4MB) cpu [6%,4%,17%,51%]@102 GR3D 0%@76 EDP limit 0
RAM 2093/3994MB (lfb 151x4MB) cpu [1%,5%,7%,0%]@102 GR3D 0%@76 EDP limit 0
RAM 2093/3994MB (lfb 151x4MB) cpu [6%,4%,0%,0%]@102 GR3D 0%@76 EDP limit 0
RAM 2093/3994MB (lfb 151x4MB) cpu [11%,3%,0%,0%]@102 GR3D 0%@76 EDP limit 0
RAM 2093/3994MB (lfb 151x4MB) cpu [6%,4%,0%,0%]@102 GR3D 0%@76 EDP limit 0
RAM 2078/3994MB (lfb 153x4MB) cpu [11%,2%,1%,0%]@102 GR3D 0%@76 EDP limit 0
RAM 2077/3994MB (lfb 153x4MB) cpu [9%,1%,0%,0%]@1224 GR3D 0%@76 EDP limit 0
RAM 2077/3994MB (lfb 153x4MB) cpu [5%,0%,0%,0%]@102 GR3D 0%@76 EDP limit 0
RAM 2077/3994MB (lfb 153x4MB) cpu [12%,2%,1%,0%]@102 GR3D 0%@76 EDP limit 0
RAM 2077/3994MB (lfb 153x4MB) cpu [10%,2%,0%,0%]@102 GR3D 0%@76 EDP limit 0
RAM 2077/3994MB (lfb 153x4MB) cpu [9%,4%,1%,0%]@102 GR3D 0%@76 EDP limit 0
RAM 2077/3994MB (lfb 153x4MB) cpu [7%,1%,0%,0%]@102 GR3D 0%@76 EDP limit 0
RAM 2077/3994MB (lfb 153x4MB) cpu [8%,2%,0%,0%]@102 GR3D 0%@76 EDP limit 0
RAM 2077/3994MB (lfb 153x4MB) cpu [5%,5%,0%,0%]@102 GR3D 0%@76 EDP limit 0
RAM 2077/3994MB (lfb 153x4MB) cpu [9%,5%,0%,0%]@102 GR3D 0%@76 EDP limit 0

WHEN APP IS RUNNING NORMALLY:
ubuntu@tegra-ubuntu:~/build$ ~/tegrastats
RAM 2800/3994MB (lfb 47x4MB) cpu [0%,0%,0%,0%]@1734 GR3D 75%@614 EDP limit 0
RAM 2800/3994MB (lfb 47x4MB) cpu [31%,19%,38%,40%]@1734 GR3D 65%@614 EDP limit 0
RAM 2800/3994MB (lfb 47x4MB) cpu [11%,19%,52%,38%]@1734 GR3D 64%@460 EDP limit 0
RAM 2800/3994MB (lfb 47x4MB) cpu [26%,12%,46%,38%]@1555 GR3D 64%@460 EDP limit 0
RAM 2800/3994MB (lfb 47x4MB) cpu [25%,23%,34%,35%]@1734 GR3D 61%@384 EDP limit 0
RAM 2800/3994MB (lfb 47x4MB) cpu [26%,15%,49%,41%]@1734 GR3D 60%@460 EDP limit 0
RAM 2800/3994MB (lfb 47x4MB) cpu [24%,10%,53%,43%]@1632 GR3D 52%@691 EDP limit 0
RAM 2800/3994MB (lfb 47x4MB) cpu [32%,10%,39%,34%]@1734 GR3D 77%@537 EDP limit 0
RAM 2796/3994MB (lfb 47x4MB) cpu [17%,10%,55%,41%]@1734 GR3D 0%@614 EDP limit 0
RAM 2800/3994MB (lfb 47x4MB) cpu [23%,28%,43%,35%]@1632 GR3D 72%@460 EDP limit 0
RAM 2799/3994MB (lfb 47x4MB) cpu [11%,12%,61%,40%]@921 GR3D 47%@460 EDP limit 0
RAM 2800/3994MB (lfb 47x4MB) cpu [20%,21%,46%,40%]@1428 GR3D 53%@614 EDP limit 0
RAM 2800/3994MB (lfb 47x4MB) cpu [25%,17%,45%,37%]@1132 GR3D 76%@537 EDP limit 0
RAM 2800/3994MB (lfb 47x4MB) cpu [40%,20%,31%,39%]@1734 GR3D 63%@460 EDP limit 0
RAM 2800/3994MB (lfb 47x4MB) cpu [28%,15%,46%,36%]@1632 GR3D 63%@537 EDP limit 0
RAM 2791/3994MB (lfb 47x4MB) cpu [36%,27%,25%,41%]@1734 GR3D 71%@614 EDP limit 0

WHEN APP HANGS (Notice the GR3D is 99% loaded)
ubuntu@tegra-ubuntu:~/build$ ~/tegrastats
RAM 2983/3994MB (lfb 17x4MB) cpu [0%,0%,0%,0%]@1036 GR3D 99%@998 EDP limit 0
RAM 2983/3994MB (lfb 17x4MB) cpu [3%,0%,48%,9%]@1036 GR3D 99%@998 EDP limit 0
RAM 2983/3994MB (lfb 17x4MB) cpu [3%,48%,4%,7%]@1036 GR3D 99%@998 EDP limit 0
RAM 2983/3994MB (lfb 17x4MB) cpu [2%,47%,0%,10%]@1036 GR3D 99%@998 EDP limit 0
RAM 2983/3994MB (lfb 17x4MB) cpu [1%,54%,0%,3%]@1036 GR3D 99%@998 EDP limit 0
RAM 2983/3994MB (lfb 17x4MB) cpu [1%,50%,0%,8%]@1036 GR3D 99%@998 EDP limit 0
RAM 2983/3994MB (lfb 17x4MB) cpu [1%,48%,0%,5%]@1036 GR3D 99%@998 EDP limit 0
RAM 2983/3994MB (lfb 17x4MB) cpu [2%,51%,1%,7%]@1036 GR3D 99%@998 EDP limit 0
RAM 2983/3994MB (lfb 17x4MB) cpu [0%,47%,0%,2%]@1036 GR3D 99%@998 EDP limit 0
RAM 2983/3994MB (lfb 17x4MB) cpu [2%,50%,0%,4%]@1036 GR3D 99%@998 EDP limit 0
RAM 2983/3994MB (lfb 17x4MB) cpu [0%,52%,0%,3%]@1036 GR3D 99%@998 EDP limit 0
RAM 2983/3994MB (lfb 17x4MB) cpu [2%,55%,1%,13%]@1036 GR3D 99%@998 EDP limit 0
RAM 2983/3994MB (lfb 17x4MB) cpu [1%,50%,0%,7%]@1036 GR3D 99%@998 EDP limit 0
RAM 2983/3994MB (lfb 17x4MB) cpu [0%,48%,0%,12%]@1036 GR3D 99%@998 EDP limit 0
RAM 2983/3994MB (lfb 17x4MB) cpu [2%,44%,1%,12%]@1036 GR3D 99%@998 EDP limit 0
RAM 2983/3994MB (lfb 17x4MB) cpu [1%,51%,1%,7%]@1036 GR3D 99%@998 EDP limit 0

I was also able to finally attach cuda-gdb to the process.

The cuda-gdb backtrace during the hang shows:
Attaching to process running on watchdogged GPU is not possible.
Please repeat the attempt in console mode or restart the process with CUDA_VISIBLE_DEVICES environment variable set.
A program is being debugged already. Kill it? (y or n) n
Program not killed.
(cuda-gdb) bt
#0 0x0000007f76b8ec14 in nanosleep () from /lib/aarch64-linux-gnu/libc.so.6
#1 0x0000007f76bb4088 in usleep () from /lib/aarch64-linux-gnu/libc.so.6
#2 0x0000007f6228f4f4 in cuVDPAUCtxCreate () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#3 0x0000007f621a7600 in cudbgApiDetach () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#4 0x0000007f61f71d58 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#5 0x0000007f622244c4 in cuMemFree_v2 () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#6 0x0000007f7d64ff2c in ?? () from /usr/local/cuda-8.0/lib64/libcudart.so.8.0
#7 0x00000000004700e0 in ?? ()
(cuda-gdb)

Where as the gdb output during the hang shows:
0x0000007f6269bbb4 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
(gdb) bt
#0 0x0000007f6269bbb4 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#1 0x0000007f61f8ac6c in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#2 0x0000007f6228e560 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#3 0x0000007f6228f0e8 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#4 0x0000007f6228f504 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#5 0x0000007f621a7600 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#6 0x0000007f61f71d58 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#7 0x0000007f622244c4 in cuMemFree_v2 () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#8 0x0000007f7d64ff2c in ?? () from /usr/local/cuda-8.0/lib64/libcudart.so.8.0
#9 0x00000000004700e0 in ?? ()
Backtrace stopped: not enough registers or memory available to unwind further

I’ve attached the full output in case it is helpful.

cuda-gdb_output.txt (22 KB)
gdb_output.txt (2.03 KB)

If it is sleeping perhaps it is waiting for data. If the wait is polling instead of blocking CPU would peg.

llessur,
Assume you are based on Jetpack 2.3.1 for your works, correct?

OpenCV4Tegra included in Jetpack 2.3.1 is based on version 2.4.13-17. One recommended approach is to build and use version 3.1 which is up-streamed to public OpenCV 3.1 here,

http://docs.opencv.org/master/d6/d15/tutorial_building_tegra_cuda.html

I went through the effort of building OpenCV3.1 as described here:
http://docs.opencv.org/master/d6/d15/tutorial_building_tegra_cuda.html

The only thing I did different was I ran cmake with the following option:
-DCMAKE_INSTALL_PREFIX=/home/ubuntu/installed
so I wouldn’t install OpenCV 3.1 on top of OpenCV4Tegra.

I updated my code to be compatible with OpenCV 3.1 since some of the include file locations and the API changed. Time consuming but fairly straightforward…

When I run the resulting application, the GPU still hangs.

The backtrace showed:
(cuda-gdb) bt
#0 0x0000007fa3169c18 in nanosleep () from /lib/aarch64-linux-gnu/libc.so.6
#1 0x0000007fa318f088 in usleep () from /lib/aarch64-linux-gnu/libc.so.6
#2 0x0000007f9408a4f4 in cuVDPAUCtxCreate () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#3 0x0000007f93fa2600 in cudbgApiDetach () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#4 0x0000007f93d6cd58 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#5 0x0000007f9401f4c4 in cuMemFree_v2 () from /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#6 0x0000007fa5b41f2c in ?? () from /usr/local/cuda-8.0/lib64/libcudart.so.8.0
#7 0x00000000005861f0 in ?? ()

I doubled checked to make sure the OpenCV3.1 shared library was being used by running lsof -p

The output showed:
ubuntu@tegra-ubuntu:~/Debug$ lsof -p 373
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
unit_test 373 ubuntu cwd DIR 179,1 4096 442482 /home/ubuntu/gpu_ReddyImageRegistration/Debug
unit_test 373 ubuntu rtd DIR 179,1 4096 2 /
unit_test 373 ubuntu txt REG 179,1 960144 405344 /home/ubuntu/gpu_ReddyImageRegistration/Debug/unit_test_for_nvidia
unit_test 373 ubuntu mem REG 0,9 6121 anon_inode:dmabuf (stat: No such file or directory)
unit_test 373 ubuntu mem REG 179,1 47776 269202 /usr/lib/aarch64-linux-gnu/tegra/libnvos.so
unit_test 373 ubuntu mem REG 179,1 160816 269159 /usr/lib/aarch64-linux-gnu/tegra/libnvrm.so
unit_test 373 ubuntu mem REG 179,1 110192 269192 /usr/lib/aarch64-linux-gnu/tegra/libnvrm_gpu.so
unit_test 373 ubuntu mem REG 179,1 15122216 269212 /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1.1
unit_test 373 ubuntu mem REG 179,1 52152 156075 /usr/lib/aarch64-linux-gnu/libtbbmalloc.so.2
unit_test 373 ubuntu mem REG 179,1 53976 138324 /usr/lib/aarch64-linux-gnu/libjbig.so.0
unit_test 373 ubuntu mem REG 179,1 116816 787002 /lib/aarch64-linux-gnu/liblzma.so.5.0.0
unit_test 373 ubuntu mem REG 179,1 92400 787163 /lib/aarch64-linux-gnu/libz.so.1.2.8
unit_test 373 ubuntu mem REG 179,1 187224 156076 /usr/lib/aarch64-linux-gnu/libtbb.so.2
unit_test 373 ubuntu mem REG 179,1 288760 139985 /usr/lib/aarch64-linux-gnu/libjasper.so.1.0.0
unit_test 373 ubuntu mem REG 179,1 422136 139841 /usr/lib/aarch64-linux-gnu/libtiff.so.5.2.4
unit_test 373 ubuntu mem REG 179,1 125232 786964 /lib/aarch64-linux-gnu/libpng12.so.0.54.0
unit_test 373 ubuntu mem REG 179,1 309472 138772 /usr/lib/aarch64-linux-gnu/libwebp.so.5.0.4
unit_test 373 ubuntu mem REG 179,1 223848 139505 /usr/lib/aarch64-linux-gnu/libjpeg.so.8.0.2
unit_test 373 ubuntu mem REG 179,1 157883488 428941 /usr/local/cuda-8.0/targets/aarch64-linux/lib/libcufft.so.8.0.34
unit_test 373 ubuntu mem REG 179,1 76946120 429004 /usr/local/cuda-8.0/targets/aarch64-linux/lib/libnppi.so.8.0.34
unit_test 373 ubuntu mem REG 179,1 416872 429008 /usr/local/cuda-8.0/targets/aarch64-linux/lib/libnppc.so.8.0.34
unit_test 373 ubuntu mem REG 179,1 27488 809341 /lib/aarch64-linux-gnu/librt-2.23.so
unit_test 373 ubuntu mem REG 179,1 10400 809321 /lib/aarch64-linux-gnu/libdl-2.23.so
unit_test 373 ubuntu mem REG 179,1 1265992 809325 /lib/aarch64-linux-gnu/libc-2.23.so
unit_test 373 ubuntu mem REG 179,1 70664 787006 /lib/aarch64-linux-gnu/libgcc_s.so.1
unit_test 373 ubuntu mem REG 179,1 643136 809311 /lib/aarch64-linux-gnu/libm-2.23.so
unit_test 373 ubuntu mem REG 179,1 1554312 131577 /usr/lib/aarch64-linux-gnu/libstdc++.so.6.0.21
unit_test 373 ubuntu mem REG 179,1 4155280 276734 /home/ubuntu/installed/lib/libopencv_core.so.3.1.0
unit_test 373 ubuntu mem REG 179,1 2785816 276860 /home/ubuntu/installed/lib/libopencv_imgproc.so.3.1.0
unit_test 373 ubuntu mem REG 179,1 280720 276897 /home/ubuntu/installed/lib/libopencv_imgcodecs.so.3.1.0
unit_test 373 ubuntu mem REG 179,1 26435784 276814 /home/ubuntu/installed/lib/libopencv_cudaarithm.so.3.1.0
unit_test 373 ubuntu mem REG 179,1 8697288 276893 /home/ubuntu/installed/lib/libopencv_cudawarping.so.3.1.0
unit_test 373 ubuntu mem REG 179,1 139560 809343 /lib/aarch64-linux-gnu/libpthread-2.23.so
unit_test 373 ubuntu mem REG 179,1 342848 428804 /usr/local/cuda-8.0/targets/aarch64-linux/lib/libcudart.so.8.0.34
unit_test 373 ubuntu mem REG 179,1 125776 809323 /lib/aarch64-linux-gnu/ld-2.23.so
unit_test 373 ubuntu 0u CHR 136,4 0t0 7 /dev/pts/4
unit_test 373 ubuntu 1u CHR 136,4 0t0 7 /dev/pts/4
unit_test 373 ubuntu 2u CHR 136,4 0t0 7 /dev/pts/4

So at this point, I’m simply looking for help on how to isolate the cause of the hang (which is most likely on my end).

I’m attaching the nvidia-bug-report-tegra.log file for the build which uses OpenCV 3.1.

Does the driver output provide any clues?

[26295.926141] gk20a gpu.0: __locked_fifo_preempt: preempt TSG 0 timeout
               
[26295.934219] gk20a gpu.0: gk20a_set_error_notifier: error notifier set to 8 for ch 502
[26295.943019] gk20a gpu.0: gk20a_set_error_notifier: error notifier set to 8 for ch 501
[26295.951217] gk20a gpu.0: gk20a_set_error_notifier: error notifier set to 8 for ch 500
[26295.959448] gk20a gpu.0: gk20a_set_error_notifier: error notifier set to 8 for ch 499
[26295.967597] ---- mlocks ----

[26295.967654] ---- syncpts ----
[26295.967685] id 1 (disp1_a) min 148527 max 148527 refs 1 (previous client : )
[26295.967710] id 2 (disp1_b) min 1 max 1 refs 1 (previous client : )
[26295.967732] id 3 (disp1_c) min 1 max 1 refs 1 (previous client : )
[26295.967756] id 5 (gpu.0_506) min 7586044 max 7586046 refs 1 (previous client : )
[26295.967779] id 6 (gpu.0_505) min 184 max 184 refs 1 (previous client : )
[26295.967804] id 7 (gpu.0_504) min 785592 max 785592 refs 1 (previous client : gpu.0_504)
[26295.967829] id 8 (gpu.0_502) min 2844042 max 2844044 refs 1 (previous client : gpu.0_498)
[26295.967853] id 9 (gpu.0_501) min 691934 max 691934 refs 1 (previous client : gpu.0_499)
[26295.967878] id 11 (gpu.0_500) min 6 max 6 refs 1 (previous client : gpu.0_500)
[26295.967901] id 12 (gpu.0_499) min 6 max 6 refs 1 (previous client : gpu.0_501)
[26295.967923] id 13 (gpu.0_498) min 6 max 6 refs 1 (previous client : gpu.0_502)
[26295.967945] id 14 () min 6 max 6 refs 0 (previous client : gpu.0_497)
[26295.967984] id 27 (vblank1) min 1305489 max 0 refs 1 (previous client : )

[26295.968217] ---- channels ----
[26295.968245] 
               ---- host general irq ----

[26295.968273] sync_hintmask_ext = 0xc0000000
[26295.968291] sync_hintmask = 0x80000000
[26295.968308] sync_intc0mask = 0x00000001
[26295.968325] sync_intmask = 0x00000011
[26295.968340] 
               ---- host syncpt irq mask ----

[26295.968369] syncpt_thresh_int_mask(0) = 0x00010401
[26295.968387] syncpt_thresh_int_mask(1) = 0x00000000
[26295.968404] syncpt_thresh_int_mask(2) = 0x00000000
[26295.968421] syncpt_thresh_int_mask(3) = 0x00000000
[26295.968438] syncpt_thresh_int_mask(4) = 0x00000000
[26295.968455] syncpt_thresh_int_mask(5) = 0x00000000
[26295.968471] syncpt_thresh_int_mask(6) = 0x00000000
[26295.968488] syncpt_thresh_int_mask(7) = 0x00000000
[26295.968505] syncpt_thresh_int_mask(8) = 0x00000000
[26295.968522] syncpt_thresh_int_mask(9) = 0x00000000
[26295.968538] syncpt_thresh_int_mask(10) = 0x00000000
[26295.968555] syncpt_thresh_int_mask(11) = 0x00000000
[26295.968569] 
               ---- host syncpt irq status ----

[26295.968595] syncpt_thresh_cpu0_int_status(0) = 0x00000000
[26295.968613] syncpt_thresh_cpu0_int_status(1) = 0x00000000
[26295.968631] syncpt_thresh_cpu0_int_status(2) = 0x00000000
[26295.968648] syncpt_thresh_cpu0_int_status(3) = 0x00000000
[26295.968665] syncpt_thresh_cpu0_int_status(4) = 0x00000000
[26295.968682] syncpt_thresh_cpu0_int_status(5) = 0x00000000
[26295.968696] 
               ---- host syncpt thresh ----

[26295.968724] syncpt_int_thresh_thresh_0(0) = 1
[26295.968748] syncpt_int_thresh_thresh_0(5) = 7586046
[26295.968769] syncpt_int_thresh_thresh_0(8) = 2844044
[26295.969000] gpu.0 pbdma 0: 
[26295.969024] id: 506 (channel), next_id: 506 (channel) status: valid
[26295.969063] PUT: 0000001c0000ffdc GET: 0000001c0000fbe4 FETCH: 000000fe HEADER: 60030100

[26295.969096] gpu.0 eng 0: 
[26295.969117] id: 0 (tsg), next_id: 506 (channel), ctx: ctxsw_switch 
[26295.969134] busy 

[26295.969160] gpu.0 eng 1: 
[26295.969180] id: 0 (tsg), next_id: 0 (tsg), ctx: invalid 


[26295.969863] 498-gpu.0, pid 373, refs: 2: 
[26295.969883]  in use idle not busy
[26295.969922] TOP: 8000002000580018 PUT: 0000002000580018 GET: 0000002000580018 FETCH: 0000002000580018
               HEADER: 60400000 COUNT: 80000000
               SYNCPOINT 00000000 00000d01 SEMAPHORE 00000000 00000000 00000000 00000000

[26295.969973] 499-gpu.0, pid 373, refs: 2: 
[26295.969988]  in use idle not busy
[26295.970024] TOP: 80000020003c0018 PUT: 00000020003c0018 GET: 00000020003c0018 FETCH: 00000020003c0018
               HEADER: 60400000 COUNT: 80000000
               SYNCPOINT 00000000 00000c01 SEMAPHORE 00000000 00000000 00000000 00000000

[26295.970072] 500-gpu.0, pid 373, refs: 2: 
[26295.970086]  in use idle not busy
[26295.970120] TOP: 80000020001e0018 PUT: 00000020001e0018 GET: 00000020001e0018 FETCH: 00000020001e0018
               HEADER: 60400000 COUNT: 80000000
               SYNCPOINT 00000000 00000b01 SEMAPHORE 00000000 00000000 00000000 00000000

[26295.970165] 501-gpu.0, pid 373, refs: 2: 
[26295.970178]  in use idle not busy
[26295.970211] TOP: 800000200019b318 PUT: 000000200019b318 GET: 000000200019b318 FETCH: 000000200019b318
               HEADER: 60400000 COUNT: 80000000
               SYNCPOINT 00000000 00000901 SEMAPHORE 00000000 00000000 00000000 00000000

[26295.970257] 502-gpu.0, pid 373, refs: 4: 
[26295.970273] not in use on_eng_pending busy
[26295.970309] TOP: 80000001000e0488 PUT: 00000001000e06e4 GET: 00000001000e0488 FETCH: 0000002000027320
               HEADER: 600101b4 COUNT: 01110022
               SYNCPOINT 00000000 00000801 SEMAPHORE 00000001 0000fbd0 00052762 00000004

[26295.970355] 503-gpu.0, pid 2423, refs: 2: 
[26295.970369]  in use idle not busy
[26295.970403] TOP: 0000000000000000 PUT: 0000000000000000 GET: 0000000000000000 FETCH: 0000000000000000
               HEADER: 60400000 COUNT: 00000000
               SYNCPOINT 00000000 00000000 SEMAPHORE 00000000 00000000 00000000 00000000

[26295.970450] 504-gpu.0, pid 2423, refs: 2: 
[26295.970464]  in use idle not busy
[26295.970498] TOP: 800000200002c3d0 PUT: 000000200002c3d0 GET: 000000200002c3d0 FETCH: 000000200002c3d0
               HEADER: 60400000 COUNT: 80000000
               SYNCPOINT 00000000 00000701 SEMAPHORE 0000001c 030c05a0 000f7f79 00001004

[26295.970542] 505-gpu.0, pid 736, refs: 2: 
[26295.970556]  in use idle not busy
[26295.970589] TOP: 8000002000180b80 PUT: 0000002000180b80 GET: 0000002000180b80 FETCH: 0000002000180b80
               HEADER: 60400000 COUNT: 80000000
               SYNCPOINT 00000000 00000601 SEMAPHORE 00000000 00000000 00000000 00000000

[26295.970635] 506-gpu.0, pid 736, refs: 4: 
[26295.970650]  in use on_pbdma_and_eng busy
[26295.970686] TOP: 8000002000020fc0 PUT: 0000002000020fc0 GET: 0000002000020fc0 FETCH: 0000002000020fc0
               HEADER: 60400000 COUNT: 80000000
               SYNCPOINT 00000000 00000501 SEMAPHORE 0000001c 00480000 00000000 01100002

[26295.970735] 507-gpu.0, pid 736, refs: 2: 
[26295.970749]  in use idle not busy
[26295.970782] TOP: 0000000000000000 PUT: 0000000000000000 GET: 0000000000000000 FETCH: 0000000000000000
               HEADER: 60400000 COUNT: 00000000
               SYNCPOINT 00000000 00000000 SEMAPHORE 00000000 00000000 00000000 00000000

[26295.970826] 508-gpu.0, pid 736, refs: 2: 
[26295.970839]  in use idle not busy
[26295.970871] TOP: 0000000000000000 PUT: 0000000000000000 GET: 0000000000000000 FETCH: 0000000000000000
               HEADER: 60400000 COUNT: 00000000
               SYNCPOINT 00000000 00000000 SEMAPHORE 00000000 00000000 00000000 00000000

[26295.970916] 509-gpu.0, pid 736, refs: 2: 
[26295.970929]  in use idle not busy
[26295.970960] TOP: 0000000000000000 PUT: 0000000000000000 GET: 0000000000000000 FETCH: 0000000000000000
               HEADER: 60400000 COUNT: 00000000
               SYNCPOINT 00000000 00000000 SEMAPHORE 00000000 00000000 00000000 00000000

[26295.971003] 510-gpu.0, pid 736, refs: 2: 
[26295.971016]  in use idle not busy
[26295.971048] TOP: 0000000000000000 PUT: 0000000000000000 GET: 0000000000000000 FETCH: 0000000000000000
               HEADER: 60400000 COUNT: 00000000
               SYNCPOINT 00000000 00000000 SEMAPHORE 00000000 00000000 00000000 00000000

[26295.971095] 511-gpu.0, pid 736, refs: 2: 
[26295.971108]  in use idle not busy
[26295.971139] TOP: 0000000000000000 PUT: 0000000000000000 GET: 0000000000000000 FETCH: 0000000000000000
               HEADER: 60400000 COUNT: 00000000
               SYNCPOINT 00000000 00000000 SEMAPHORE 00000000 00000000 00000000 00000000

[26295.971554] gk20a gpu.0: gk20a_fifo_handle_mmu_fault: mmu fault on engine 0, engine subid 0 (gpc), client 0 (l1 0), addr 0x0000000b:0x3a18c000, type 13 (region viol), info 0x0000208d,inst_ptr 0x6092d3000
               
[26295.991380] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: gr_fecs_os_r : 0
[26295.998854] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: gr_fecs_cpuctl_r : 0x40
[26296.006095] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: gr_fecs_idlestate_r : 0x1
[26296.013484] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: gr_fecs_mailbox0_r : 0x0
[26296.020867] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: gr_fecs_mailbox1_r : 0x0
[26296.028192] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: gr_fecs_irqstat_r : 0x0
[26296.035404] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: gr_fecs_irqmode_r : 0x4
[26296.042710] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: gr_fecs_irqmask_r : 0x8704
[26296.050235] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: gr_fecs_irqdest_r : 0x0
[26296.057468] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: gr_fecs_debug1_r : 0x40
[26296.064677] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: gr_fecs_debuginfo_r : 0x0
[26296.072159] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: gr_fecs_ctxsw_mailbox_r(0) : 0x0
[26296.080197] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: gr_fecs_ctxsw_mailbox_r(1) : 0x1
[26296.088198] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: gr_fecs_ctxsw_mailbox_r(2) : 0x50009
[26296.096548] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: gr_fecs_ctxsw_mailbox_r(3) : 0x20
[26296.104624] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: gr_fecs_ctxsw_mailbox_r(4) : 0x2000a0
[26296.113091] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: gr_fecs_ctxsw_mailbox_r(5) : 0x0
[26296.121088] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: gr_fecs_ctxsw_mailbox_r(6) : 0x0
[26296.129087] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: gr_fecs_ctxsw_mailbox_r(7) : 0x0
[26296.137087] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: gr_fecs_engctl_r : 0x0
[26296.144296] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: gr_fecs_curctx_r : 0x0
[26296.151428] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: gr_fecs_nxtctx_r : 0x0
[26296.158637] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: FECS_FALCON_REG_IMB : 0xbadfbadf
[26296.166725] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: FECS_FALCON_REG_DMB : 0xbadfbadf
[26296.174749] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: FECS_FALCON_REG_CSW : 0xbadfbadf
[26296.182761] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: FECS_FALCON_REG_CTX : 0xbadfbadf
[26296.190759] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: FECS_FALCON_REG_EXCI : 0xbadfbadf
[26296.198847] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: FECS_FALCON_REG_PC : 0xbadfbadf
[26296.206797] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: FECS_FALCON_REG_SP : 0xbadfbadf
[26296.214699] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: FECS_FALCON_REG_PC : 0xbadfbadf
[26296.222614] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: FECS_FALCON_REG_SP : 0xbadfbadf
[26296.230524] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: FECS_FALCON_REG_PC : 0xbadfbadf
[26296.238436] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: FECS_FALCON_REG_SP : 0xbadfbadf
[26296.246442] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: FECS_FALCON_REG_PC : 0xbadfbadf
[26296.254346] gk20a gpu.0: gk20a_fecs_dump_falcon_stats: FECS_FALCON_REG_SP : 0xbadfbadf
[26296.262326] gk20a gpu.0: gk20a_fifo_handle_mmu_fault: gr_status_r : 0x1000081
[26296.271414] gk20a gpu.0: gk20a_fifo_set_ctx_mmu_error_tsg: TSG 0 generated a mmu fault
[26296.279433] gk20a gpu.0: gk20a_fifo_handle_sched_error: fifo sched error : 0x0000000a, failed to find engine
               
[26296.426455] tegradc tegradc.1: blank - powerdown
[26296.426623] tegradc tegradc.1: unblank
[26296.472824] tegradc tegradc.1: nominal-pclk:154128000 parent:154127813 div:1.0 pclk:154127813 152586720~167999520
[26296.841206] tegradc tegradc.1: unblank
[26296.880636] tegradc tegradc.1: blank - powerdown
[26296.908732] tegradc tegradc.1: unblank
[26296.958584] tegradc tegradc.1: nominal-pclk:154128000 parent:154127813 div:1.0 pclk:154127813 152586720~167999520
[26297.328695] tegradc tegradc.1: unblank
[26297.367122] tegradc tegradc.1: blank - powerdown
[26297.396274] tegradc tegradc.1: unblank
[26297.443145] tegradc tegradc.1: nominal-pclk:154128000 parent:154127813 div:1.0 pclk:154127813 152586720~167999520
[26297.811419] tegradc tegradc.1: unblank
[26297.851987] tegradc tegradc.1: blank - powerdown
[26297.880357] tegradc tegradc.1: unblank
[26297.926944] tegradc tegradc.1: nominal-pclk:154128000 parent:154127813 div:1.0 pclk:154127813 152586720~167999520
[26298.295241] tegradc tegradc.1: unblank
[26299.486498] tegradc tegradc.1: blank - powerdown
[26299.486574] tegradc tegradc.1: hdmi: unplugged
[26299.486652] tegradc tegradc.1: unblank
[26299.486677] tegradc tegradc.1: unblank
[26299.671874] tegradc tegradc.1: vrr_setup failed
[26299.671989] tegradc tegradc.1: hdmi: plugged
[26299.686653] tegradc tegradc.1: blank - powerdown
[26299.686858] tegradc tegradc.1: unblank
[26299.733093] tegradc tegradc.1: nominal-pclk:154128000 parent:154127813 div:1.0 pclk:154127813 152586720~167999520
[26300.100832] tegradc tegradc.1: unblank
[26300.149158] tegradc tegradc.1: blank - powerdown
[26300.168753] tegradc tegradc.1: unblank
[26300.215062] tegradc tegradc.1: nominal-pclk:154128000 parent:154127813 div:1.0 pclk:154127813 152586720~167999520
[26300.567178] tegradc tegradc.1: unblank

nvidia-bug-report-tegra.log (262 KB)

Not sure it solves your problem, but one could notice first fault happens with :

[   84.727910] compiz[1433]: unhandled level 2 translation fault (11) at 0x00000000, esr 0x92000006
[   84.727920] pgd = ffffffc0cd674000
[   84.731385] [00000000] *pgd=000000014d670003, *pmd=0000000000000000

[   84.738048] CPU: 1 PID: 1433 Comm: compiz Not tainted 3.10.96-tegra #1
[   84.738056] task: ffffffc0cdac41c0 ti: ffffffc0cdf68000 task.ti: ffffffc0cdf68000
[   84.738065] PC is at 0x7f7046b878
[   84.738070] LR is at 0x7f7046b870
[   84.738076] pc : [<0000007f7046b878>] lr : [<0000007f7046b870>] pstate: 40000000
[   84.738080] sp : 0000007fdc9ae760
[   84.738084] x29: 0000007fdc9ae760 x28: 0000000000ce8118 
[   84.738092] x27: 0000007fdc9ae8a8 x26: 0000000000d1b978 
[   84.738099] x25: 0000007fdc9ae7b8 x24: 0000000000ce8118 
[   84.738106] x23: 0000000000c9c310 x22: 0000007fdc9ae7b0 
[   84.738113] x21: 0000000000f1f6f0 x20: 0000007f70765000 
[   84.738120] x19: 0000007fdc9ae7c8 x18: 000000000000003c 
[   84.738126] x17: 0000007f7b994808 x16: 0000007f7bc14608 
[   84.738133] x15: 0000000000000070 x14: 0000007f6694d760 
[   84.738140] x13: 0000007f6694d990 x12: 0000007f6694d000 
[   84.738146] x11: 0000000000000001 x10: 0000007fdc9adec0 
[   84.738153] x9 : 000000000104d410 x8 : 0000000000fa1ad0 
[   84.738160] x7 : 0000000000000000 x6 : 0000000000000000 
[   84.738166] x5 : 0000007f7ba639b0 x4 : 00000000ffffffff 
[   84.738173] x3 : 0000000000fa1ad0 x2 : f3f4bcc5fe883800 
[   84.738179] x1 : 0000000000000000 x0 : 0000000000000000 

[   84.738198] Library at 0x7f7046b878: 0x7f7014d000 /usr/lib/aarch64-linux-gnu/compiz/libunityshell.so
[   84.747415] Library at 0x7f7046b870: 0x7f7014d000 /usr/lib/aarch64-linux-gnu/compiz/libunityshell.so
[   84.756705] vdso base = 0x7f7bd40000
[ 2182.857212] tegradc tegradc.1: hdmi: unplugged
[ 2182.871060] tegradc tegradc.1: blank - powerdown
[ 2182.885991] tegradc tegradc.1: unblank
[ 2182.886062] tegradc tegradc.1: unblank
[ 2183.051836] tegradc tegradc.1: vrr_setup failed
[ 2183.051941] tegradc tegradc.1: hdmi: plugged
[ 2183.058528] tegradc tegradc.1: blank - powerdown
[ 2183.075989] tegradc tegradc.1: unblank
[ 2183.160999] tegradc tegradc.1: nominal-pclk:154128000 parent:154127813 div:1.0 pclk:154127813 152586720~167999520
[ 2183.878374] tegradc tegradc.1: unblank
[ 2183.997743] tegradc tegradc.1: blank - powerdown
[ 2184.026508] tegradc tegradc.1: unblank
[ 2184.102125] tegradc tegradc.1: nominal-pclk:154128000 parent:154127813 div:1.0 pclk:154127813 152586720~167999520
[ 2184.728240] tegradc tegradc.1: unblank

Looks like compiz has triggered a memory fault and long after that X server or the display driver seems to restart.
Just a clue, no precise advice for this. I think I’ve had to disable compiz or parts of it on a TK1 with L4T R19 running OpenCV, but not investigated enough to be sure.

Not sure what the root cause really was but we were able to refactor our CUDA kernel code to not call __syncthreads() and the GPU no longer hangs.

Thanks for all the suggestions.