SIGSEGV, Segmentation fault when building TLT engines

We have a use case where we want to pre-build model engines automatically, not when nvinfer is launched, and without parsing .txt files.
For that, we use nvdsinfer source code, available at path: /opt/nvidia/deepstream/deepstream-5.0/source/libs/nvdsinfer. This is a minimal, reproducible example of what we do:

main.cpp (4.7 KB)

In this code, we build a yolov3-tiny model, and then a peoplenet model.
The yolov3-tiny one builds fine, but peoplenet segfaults.
When only building peoplenet, it works. When building peoplenet and then yolov3-tiny both work too.
So order has an influence.

Here is the gdb output:

(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7fa0a83560 (LWP 29112) "test_main" 0x0000007faa7c0434 in ?? () from /usr/lib/aarch64-linux-gnu/libnvparsers.so.7
  2    Thread 0x7f999ef9b0 (LWP 29332) "cuda-EvtHandlr" 0x0000007faa1f0048 in __GI___poll (fds=0x559d1ad590, nfds=4294967295, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:41
  3    Thread 0x7f927f59b0 (LWP 30824) "test_main" 0x0000007faa09a9c8 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55558f2460)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
(gdb) backtrace
#0  0x0000007faa7c0434 in ?? () from /usr/lib/aarch64-linux-gnu/libnvparsers.so.7
#1  0x0000007faa7c3918 in ?? () from /usr/lib/aarch64-linux-gnu/libnvparsers.so.7
#2  0x0000007fb7e620f8 in ?? () from /opt/nvidia/deepstream/deepstream-5.0/lib/libnvds_inferutils.so
#3  0x0000007fb7e62790 in NvDsInferCudaEngineGetFromTltModel () from /opt/nvidia/deepstream/deepstream-5.0/lib/libnvds_inferutils.so
#4  0x0000005555560768 in nvdsinfer::TrtModelBuilder::getCudaEngineFromCustomLib (this=0x559d1984b0, cudaEngineGetDeprecatedFcn=0x0, 
    cudaEngineGetFcn=0x7fb7e622d0 <NvDsInferCudaEngineGetFromTltModel>, initParams=..., networkMode=@0x7ffffe5900: NvDsInferNetworkMode_FP32)
    at ../tegra_platform/nvds_model_builder/nvdsinfer_model_builder.cpp:791
#5  0x0000005555560b64 in nvdsinfer::TrtModelBuilder::buildModel (this=0x559d1984b0, initParams=..., suggestedPathName="") at ../tegra_platform/nvds_model_builder/nvdsinfer_model_builder.cpp:858
#6  0x000000555555a598 in buildModel (params=...) at ../tegra_platform/nvds_model_builder/main.cpp:109
#7  0x000000555555a7b8 in main () at ../tegra_platform/nvds_model_builder/main.cpp:126
(gdb) thread 2
[Switching to thread 2 (Thread 0x7f999ef9b0 (LWP 29332))]
#0  0x0000007faa1f0048 in __GI___poll (fds=0x559d1ad590, nfds=4294967295, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:41
41	../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
(gdb) backtrace
#0  0x0000007faa1f0048 in __GI___poll (fds=0x559d1ad590, nfds=4294967295, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:41
#1  0x0000007f9fcd20a0 in ?? () from /usr/lib/aarch64-linux-gnu/libcuda.so
#2  0x0000007f9fd52b94 in ?? () from /usr/lib/aarch64-linux-gnu/libcuda.so
#3  0x0000007f9fcd44bc in ?? () from /usr/lib/aarch64-linux-gnu/libcuda.so
#4  0x0000007faa092088 in start_thread (arg=0x7ffffe508f) at pthread_create.c:463
#5  0x0000007faa1f94ec in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
(gdb) thread 3
[Switching to thread 3 (Thread 0x7f927f59b0 (LWP 30824))]
#0  0x0000007faa09a9c8 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55558f2460) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
205	../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
(gdb) backtrace
#0  0x0000007faa09a9c8 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55558f2460) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  do_futex_wait (sem=sem@entry=0x55558f2460, abstime=0x0) at sem_waitcommon.c:111
#2  0x0000007faa09aae8 in __new_sem_wait_slow (sem=0x55558f2460, abstime=0x0) at sem_waitcommon.c:181
#3  0x0000007f9fcd26a0 in ?? () from /usr/lib/aarch64-linux-gnu/libcuda.so
#4  0x0000007f9fcbdfb4 in ?? () from /usr/lib/aarch64-linux-gnu/libcuda.so
#5  0x0000007f9fcd44bc in ?? () from /usr/lib/aarch64-linux-gnu/libcuda.so
#6  0x0000007faa092088 in start_thread (arg=0x7ffffe0f3f) at pthread_create.c:463
#7  0x0000007faa1f94ec in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

• Hardware Platform: Jetson AGX Xavier
• DeepStream Version: 5.0 GA
• JetPack Version: 4.4
• TensorRT Version: 7.1.3

@Blard.Theophile

Could you please provide an entire building directory with source code, makefile, configuration files and README included so that we can quickly compile, build and debug the program?

Of course: model-builder.zip (23.0 KB)

You’ll find the building instructions in the README.
The prerequisites are to download PeoplenetV2 from Nvidia NGC and put it in the default directory + download the Yolo weights & build the Yolo library (if it’s not already done).

nvdsinfer_func_utils.cpp, nvdsinfer_func_utils.h, nvdsinfer_model_builder.h and nvdsinfer_model_builder.cpp are unmodified copy from /opt/nvidia/deepstream/deepstream-5.0/sources/libs/nvdsinfer/, moved here for simplicity.

Hi, did you manage to reproduce ?

@Blard.Theophile

Yes, reproduced, but still searching for the root cause

Any update on this ?

The problem can be reproduced. It will take some time to debug it. But it can work if you remove Yolo engine steps and only do peoplenet engine generating.

You can switch the order of the two models, generate peoplenet first. It can work.

Obviously, switching the build order works but this is not a satisfying solution for us. In our product, the engines are built dynamically, depending on our end-users needs. We built an application that allow them to choose which model they want to use, and on how much cameras. Thereby, the build order will never be predetermined.

We have internal bug to track this issue and will be back to you when there is any progress.