We have an application that processes 6 cameras using libargus and then runs inference on the images using libnvinfer. At a high level the camera capture happens on a series of threads (one per camera) and the inference runs on its own thread. The same application also runs on a different variant of our system that uses a Jetson Nano and 4 cameras with no issues. In the AGX system we have been plagued by segfaults, some of which will happen randomly after running fine for several hours, and others which can happen very deterministically on startup when we start to use the cameras (however it it gets past this code, it can run fine for several hours). We are using R32.4.4 of L4T and we have some driver patches for the cameras and dts files, but other than that it is pretty stock code.
With a big of digging we can see that in both failure cases (on startup, or randomly after several hours) we hit a segfault in the libnvrm_gpu.so
library and always on this instruction in the library:
1ab7c: b9000001 str w1, [x0] <- at the time of the crash x0 holds 0x7fb7fbf09
If you trace the faulting memory address and look at the memory map, you can see that it is in the mmap’d region for nvhost-ctrl-gpu
which when you catch it crashing in GDB will look like this:
# pmap -x 12364 | grep 7fb7fbf
0000007fb7fbf000 4 0 0 ----- nvhost-ctrl-gpu
0000007fb7fbf000 0 0 0 ----- nvhost-ctrl-gpu
If you stop the process before the crash (or really at any time) and look at the same mapping you can see that it is mapped read/write:
0000007fb7fbf000 4 0 0 rw-s- nvhost-ctrl-gpu
0000007fb7fbf000 0 0 0 rw-s- nvhost-ctrl-gpu
We even looked at straces to see if something in the application was changing the protection on that mmap, but we find no evidence of that, here is the relevant info from an strace:
[pid 1697] openat(AT_FDCWD, "/dev/nvhost-ctrl-gpu", O_RDWR|O_CLOEXEC) = 15
[pid 1697] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, 15, 0) = 0x7f918c5000
# We don't find any `mprotect` or `mremap` calls in here prior to the segfault that would impact this range
[pid 1711] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7f918c5090} ---
In addition, we frequently (although not always) get an error in dmesg like this when it faults (although it is hard to tell if this is the cause, or just an effect of the crash):
[ 1236.349196] nvgpu: 17000000.gv11b gk20a_fifo_tsg_unbind_channel_verify_status:2200 [ERR] Channel 506 to be removed from TSG 2 has NEXT set!
[ 1236.349501] nvgpu: 17000000.gv11b gk20a_tsg_unbind_channel:164 [ERR] Channel 506 unbind failed, tearing down TSG 2
[ 1236.350135] nvgpu: 17000000.gv11b gk20a_gr_isr:6021 [ERR] pgraph intr: 0x00000010, chid: 506 not bound to tsg
We’ve been able to get the system to run the longest when we change the code to initalize the cameras, take one picture, and then infer on that over and over again. However, we frequently hit this segfault when we have the cameras running, even though we’ve tried putting them on the same thread, adding serialization mutexes, making sure the cuda allocs used streams, adding -DCUDA_API_PER_THREAD_DEFAULT_STREAM=1
to the build opts, and many other attempts to find the source of this.
The puzzling thing is that in some cases (the start crash) it happens very deterministically, but if it can get through that (which most times it does) it behaves much more like a race condition bug. From what we can tell, it seems something in the kernel or deep in the libraries is changing the protection.
I’ve attached
nvidia-bug-report-tegra.log (17.1 MB)
with information about the system, and I’ve attached some strace and standard out files that show different incarnations of the bug for reference:
Crash under GDB, case 1 where we crash on startup:
pmap_3.txt (252 Bytes)
dmesg_3.txt (546 Bytes)
stdout_3.txt (7.1 KB)
Crash under GDB, case 2 where we crash after running for a while:
dmesg_4.txt (901 Bytes)
pmap_good_4.txt (252 Bytes)
stdout_4.txt (171.3 KB)
pmap_3.txt (252 Bytes)
Crash with strace, case 2 where we crash after running for a little while:
strace_5.txt (19.8 MB)
dmesg_5.txt (349 Bytes)
stdout_5.txt (7.9 KB)
NOTE: I could not catch one of the startup crashes under strace, and it is not possible to attach strace and GDB at the same time AFAIK.
This really has us stumped, and we’d really love some help trying to figure out what is going on here. Thanks in advance for any help the community can offer.