Application segfaulting using libargus and libnvinfer on AGX

We have an application that processes 6 cameras using libargus and then runs inference on the images using libnvinfer. At a high level the camera capture happens on a series of threads (one per camera) and the inference runs on its own thread. The same application also runs on a different variant of our system that uses a Jetson Nano and 4 cameras with no issues. In the AGX system we have been plagued by segfaults, some of which will happen randomly after running fine for several hours, and others which can happen very deterministically on startup when we start to use the cameras (however it it gets past this code, it can run fine for several hours). We are using R32.4.4 of L4T and we have some driver patches for the cameras and dts files, but other than that it is pretty stock code.

With a big of digging we can see that in both failure cases (on startup, or randomly after several hours) we hit a segfault in the libnvrm_gpu.so library and always on this instruction in the library:

1ab7c:       b9000001        str     w1, [x0] <- at the time of the crash x0 holds 0x7fb7fbf09

If you trace the faulting memory address and look at the memory map, you can see that it is in the mmap’d region for nvhost-ctrl-gpu which when you catch it crashing in GDB will look like this:

# pmap -x 12364 | grep 7fb7fbf
0000007fb7fbf000       4       0       0 ----- nvhost-ctrl-gpu
0000007fb7fbf000       0       0       0 ----- nvhost-ctrl-gpu

If you stop the process before the crash (or really at any time) and look at the same mapping you can see that it is mapped read/write:

0000007fb7fbf000       4       0       0 rw-s- nvhost-ctrl-gpu
0000007fb7fbf000       0       0       0 rw-s- nvhost-ctrl-gpu

We even looked at straces to see if something in the application was changing the protection on that mmap, but we find no evidence of that, here is the relevant info from an strace:

[pid  1697] openat(AT_FDCWD, "/dev/nvhost-ctrl-gpu", O_RDWR|O_CLOEXEC) = 15
[pid  1697] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, 15, 0) = 0x7f918c5000
# We don't find any `mprotect` or `mremap` calls in here prior to the segfault that would impact this range
[pid  1711] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7f918c5090} ---

In addition, we frequently (although not always) get an error in dmesg like this when it faults (although it is hard to tell if this is the cause, or just an effect of the crash):

[ 1236.349196] nvgpu: 17000000.gv11b gk20a_fifo_tsg_unbind_channel_verify_status:2200 [ERR]  Channel 506 to be removed from TSG 2 has NEXT set!
[ 1236.349501] nvgpu: 17000000.gv11b          gk20a_tsg_unbind_channel:164  [ERR]  Channel 506 unbind failed, tearing down TSG 2
[ 1236.350135] nvgpu: 17000000.gv11b                      gk20a_gr_isr:6021 [ERR]  pgraph intr: 0x00000010, chid: 506 not bound to tsg

We’ve been able to get the system to run the longest when we change the code to initalize the cameras, take one picture, and then infer on that over and over again. However, we frequently hit this segfault when we have the cameras running, even though we’ve tried putting them on the same thread, adding serialization mutexes, making sure the cuda allocs used streams, adding -DCUDA_API_PER_THREAD_DEFAULT_STREAM=1 to the build opts, and many other attempts to find the source of this.

The puzzling thing is that in some cases (the start crash) it happens very deterministically, but if it can get through that (which most times it does) it behaves much more like a race condition bug. From what we can tell, it seems something in the kernel or deep in the libraries is changing the protection.

I’ve attached
nvidia-bug-report-tegra.log (17.1 MB)
with information about the system, and I’ve attached some strace and standard out files that show different incarnations of the bug for reference:

Crash under GDB, case 1 where we crash on startup:
pmap_3.txt (252 Bytes)
dmesg_3.txt (546 Bytes)
stdout_3.txt (7.1 KB)

Crash under GDB, case 2 where we crash after running for a while:
dmesg_4.txt (901 Bytes)
pmap_good_4.txt (252 Bytes)
stdout_4.txt (171.3 KB)
pmap_3.txt (252 Bytes)

Crash with strace, case 2 where we crash after running for a little while:
strace_5.txt (19.8 MB)
dmesg_5.txt (349 Bytes)
stdout_5.txt (7.9 KB)

NOTE: I could not catch one of the startup crashes under strace, and it is not possible to attach strace and GDB at the same time AFAIK.

This really has us stumped, and we’d really love some help trying to figure out what is going on here. Thanks in advance for any help the community can offer.

Is this issue reproducible on devkit? Or you are doing this on your custom board?

Is it possible to move to latest release and test?

We originally were using a devkit, although that was earlier in development and I’m not sure if we saw this error occur. Lately it has been on a system with a CTI baseboard, although the changes for that are relatively small (basically just some DTS changes) so I doubt that it is related to the baseboard. It would be a fair bit of work to configure one of these systems with a devkit baseboard at this point, although we could if you had a specific test in mind.

We could move to a newer release to test, but again, this is a fair bit of work. Do you have a specific bugfix or change in mind that you think may solve this issue that is in the newer release? I’m hesitant to put the time into upgrading if this is just for a shot in the dark. I’d be willing to do that if there are no other ideas on what could be causing the issue though.

I’ll also note that yesterday we ran another test where we put all of the code into a single thread, it ran for longer without crashing but did eventually segfault in a similar manner. This seems to suggest (at least to me) that there is some issue in the drivers, although it is totally possible we are violating some invariant in how we use the libraries, but we’ve scoured the documentation a fair bit and can’t find anything obvious.

Actually I don’t have much debug experience over libargus and libnvifer. What I want to check is whether your gpu error in dmesg is really related to that segment fault or not.

If you don’t want to upgarde the software version, I would suggest you can try to reproduce this issue on devkit, if it can, then possible to share us the code. Otherwise it is really hard for other engineers to help check.

We could probably share a compiled binary and maybe the engine file, but you would probably need a system with 6 cameras to fully simulate the situation we are seeing. Is that something your engineering team has access to? It is hard to reproduce this issue as a small snippet of code because it sometimes can take hours to happen. We could try to isolate the startup crash though.

We didn’t have 6 camera system. But it’s could be better if you can narrow down the issue first.

Looking at the long-running segfault (not the startup crash), we think we’ve isolated it to some VPI code - we’re using VPI 0.4.

We’re using VPI to convert from YUV to VPI_IMAGE_FORMAT_BGRA8
The code is similar to what’s shown in this sample. VPI - Vision Programming Interface: Temporal Noise Reduction

When we remove this code, we don’t see the long-running segfault (only the startup segfault).

Hi @jtsd
Is the issue seen by running default vpi_sample_09_tnr? Or need a patch for reproducing it? We would need your help to share full steps so that we can set up and try to reproduce the issue first, and then do further investigation.

We’ve never run vpi_sample_09_tnr and I’m not sure that code would segfault alone. For this specific segfault, we believe it’s some interaction between VPI and either libinfer or libargus, because the error goes away when we remove VPI (in favor of nvbuffertransform.) As an aside, we’ve been using VPI to avoid a colorspace conversion bug in nvbuffertransform.

Hi,
Could you help clarify if it happens in VPI + libnvargus or VPI + libnvinfer? Looks like the issue is not seen by only running vpi_sample_09_tnr. Is it possible to share a patch on the sample for reproducing the issue?

I believe this is due to an interaction between libnvinfer and VPI. I do not believe it crashes when we run the VPI code with the libargus code (but I’m not %100 sure.) I don’t have a patch on vpi_sample_09_tnr but if I were to try to make one, I’d likely run repeated inference calls (like in this tensorRT sample), in a thread.
https://github.com/NVIDIA/TensorRT/blob/master/samples/opensource/sampleMNIST/sampleMNIST.cpp#L67

Hi,
Not sure but it looks like your application is not gstreamer-based DeepStream SDK. Maybe it is close to this sample:

/usr/src/jetson_multimedia_api/samples/frontend

Please check if you can make a patch based on the sample so that we can replicate the issue and do further investigation.

I landed on this thread this morning because I’m encountering a similar issue (segfault deep inside libnvrm_gpu.so, precipitated by a call to cudaMallocPitch) on a Xavier AGX. I don’t have a whole lot to contribute yet to it, other than my segfault is happening on exactly the same assembly instruction:

0x7f86684b7c str w1, [x0]

We’re using 10.2.89, I think based on the presence of libcudart.so.10.2.89 in /usr/local/cuda-10.2. Will update more if I learn anything.

To the comment at the very start, at crash time it is trying to use the memory address:

0x7f86684b7c    str    w1, [x0]
x0             0x7fb7fac090     548547510416

And looking at pmap, it shows the same “no permissions” situation:

xavier@xavier-delta:~$ sudo pmap -x 8657 | grep 7fb7fac
0000007fb7fac000       4       0       0 ----- nvhost-ctrl-gpu
0000007fb7fac000       0       0       0 ----- nvhost-ctrl-gpu

I can’t see any of the symbol names in /usr/lib/aarch64-linux-gnu/libcuda.so, but the last function call that has a symbol is:

#10 0x0000007fa71f7950 in cudart::contextState::loadCubin(bool*, cudart::globalModule*) () at /usr/local/lib/libopencv_core.so.4.5

Hi Tony,
Just for completeness, are you running any of the following in your code? Libargus, VPI or libnvinfer?

libnvinfer1 and OpenCV recompiled with CUDA support.

A colleague yesterday had an interesting observation. While we’re not using libargus or VPI, we are (through a vendor library) using Rivermax and that’s doing zero-copy writes into memory while CUDA and nvinfer1 are also working. We’ve noticed that we’re also getting higher error rates with the Rivermax stream, so maybe there’s some conflict going on there?

@tony-persea this sounds like a data copy issue between nvinfer and libraries copying in data for it to access. I would imagine it gets worse with in increasing number of inference sessions/input sources. Are you using a AGX devkit or a module with a custom carrier board?

It’s on an AGX devkit. We actually came up with a hypothesis around this late last week. Our camera system (3rd party) is using the Mellanox Rivermax stuff doing zero-copy direct into RAM from the camera; maybe that’s causing conflicts with the CUDA/TensorRT initialization or locking?