Deepstream-test5-app

Hi,

I ran Deepstream-test5-app 3 AGX Xavier boxes, Two boxes are working fine but one box is crashing (Segmentation Fault). Below are the stack traces:
I don’t know the exact root cause. Can somebody give suggest me what is causing this issue?

Thread 36 “deepstream-test” received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ed6ffd080 (LWP 32698)]
0x0000007f9ed22b7c in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libnvrm_gpu.so
(gdb) bt
#0 0x0000007f9ed22b7c in () at /usr/lib/aarch64-linux-gnu/tegra/libnvrm_gpu.so
#1 0x0000007f9f1266dc in () at /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#2 0x0000007f9f0a44dc in () at /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#3 0x0000007f9efa5b0c in () at /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#4 0x0000007f9efa5b7c in () at /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#5 0x0000007f9f01297c in () at /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#6 0x0000007f9f169c90 in () at /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#7 0x0000007f9ef8732c in () at /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#8 0x0000007f9ef874f4 in () at /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#9 0x0000007f9f09bdb4 in cuLaunchKernel () at /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#10 0x0000007f6b525cd4 in () at /opt/nvidia/deepstream/deepstream-5.0/lib/libnvds_infer.so
#11 0x0000005555c2e2a0 in ()
#12 0x0000007f00000001 in () at /usr/lib/aarch64-linux-gnu/libcudnn_cnn_infer.so.8

Hi,
How about change eglglessink to fakesink?

I am not using the eglglessink but I am using FILE sink.

Can you enable debug option in nvinfer plugin and rebuild the nvinfer, rerun and get the trace? with debug version, you can get more info.

I compiled sources/lib/nvinfer with make clean/make/make install with -g option. After I ran test5-app, I got below segmentation issue:

Thread 36 “deepstream-test” received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7eec8de080 (LWP 3451)]
0x0000007f9ed22b7c in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libnvrm_gpu.so
(gdb) bt
#0 0x0000007f9ed22b7c in () at /usr/lib/aarch64-linux-gnu/tegra/libnvrm_gpu.so
#1 0x0000007f9f1266dc in () at /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#2 0x0000007f9f0a44dc in () at /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#3 0x0000007f9efa5b0c in () at /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#4 0x0000007f9efa5b7c in () at /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#5 0x0000007f9f01297c in () at /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#6 0x0000007f9f169c90 in () at /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#7 0x0000007f9ef8732c in () at /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#8 0x0000007f9ef874f4 in () at /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#9 0x0000007f9f09bdb4 in cuLaunchKernel ()
at /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1
#10 0x0000007f929752ac in ()
at /usr/local/cuda-10.2/targets/aarch64-linux/lib/libcudart.so.10.2
#11 0x00000000000000e0 in ()

In my environment I used either FILE sink or HLS sink. If I switch to HLS sink I am getting continuously Seg fault. But If I will use FILE sink first 10 times it works but next time onwards keep getting Seg fault.
Is there any specific on dependency on SINK?

Did the three devices have same environments, same hardware model, run same test cases?

Yes, all three have the same environment. I tried to create directory during the run time by using system command like "system(“mkdir -p “/home/muvva/test””). DS can create new directory during the run time?

Same hardware model?

Yes, All three has same hardware model. If I removed "system(“mkdir -p “/home/muvva/test””) then it works fine. I don’t know what is the reason behind this?

Did you mean in the device which have the issue, have "system(“mkdir -p “/home/muvva/test””) in your sample? but another two devices which run well did not contain this?

No, all three device has the same code.

How about the error rate? reproed every time or some failure rate?

every time it fails (100%)

But the three devices have same environments, it did not make sense, one have issue while others two does not since the failure rate is 100%.

Suspect platform issue for the failure device. you can ask help from your platform engineer.