Segmentation Fault while loading Deepstream Yolo model on Jetson Nano

joranvandew · September 21, 2021, 3:20pm

Setup information:
• Hardware Platform (Jetson / GPU) Jetson Nano
• DeepStream Version 5.0
• JetPack Version (valid for Jetson only) 4.4
• TensorRT Version 7.1.3
• Issue Type( questions, new requirements, bugs) Question

Hello,
I have set up a Deepstream application that is using a Yolo model for inference (using marcoslucianops implementation that can be found here: GitHub - marcoslucianops/DeepStream-Yolo: NVIDIA DeepStream SDK 6.1 / 6.0.1 / 6.0 configuration for YOLO models).
This application is running in a deepstream docker: nvcr.io/nvidia/deepstream-l4t, version 5.0.1-20.09-samples.
In general, this application runs fine on multiple Jetson Nano devices with the same setup.

However, we have a Jetson device that does not succeed in loading in the Deepstream inference model.
When the application is started and the TensorRT engine file is being loaded into the nvinfer module, a segmentation fault occurs as follows:

Thread 1 "application" received signal SIGSEGV, Segmentation fault.
0x0000007fb7fdc1d4 in elf_machine_rela_relative (reloc_addr_arg=0x7f52a34000, reloc=0x7f5ef2c000, l_addr=546847277056) at ../sysdeps/aarch64/dl-machine.h:376
376     ../sysdeps/aarch64/dl-machine.h: No such file or directory.

gdb backtrace shows the following:

#0  0x0000007fb7fdc1d4 in elf_machine_rela_relative (reloc_addr_arg=0x7f52f25000, reloc=0x7f5f41d000, l_addr=546852458496) at ../sysdeps/aarch64/dl-machine.h:376
#1  0x0000007fb7fdc1d4 in elf_dynamic_do_Rela (skip_ifunc=0, lazy=0, nrelative=<optimized out>, relsize=<optimized out>, reladdr=<optimized out>, map=0x5578371de0) at do-rel.h:112
#2  0x0000007fb7fdc1d4 in _dl_relocate_object (scope=<optimized out>, reloc_mode=reloc_mode@entry=0, consider_profiling=<optimized out>, consider_profiling@entry=0) at dl-reloc.c:258
#3  0x0000007fb7fe2a1c in dl_open_worker (a=0x7fffffa398) at dl-open.c:382
#4  0x0000007fb7728694 in __GI__dl_catch_exception (exception=0xfffffffffffffffe, operate=0x7fffffa1bc, args=0x7fffffa380) at dl-error-skeleton.c:196
#5  0x0000007fb7fe2418 in _dl_open (file=0x7fb003ca80 "libcudnn_cnn_infer.so.8", mode=-2147483646, caller_dlopen=0x7fb002cc24 <cudnnCreateConvolutionDescriptor+156>, nsid=-2, argc=1, argv=0x7ffffff1f8, env=<optimized out>) at dl-open.c:605
#6  0x0000007fb75f5014 in dlopen_doit (a=0x7fffffa658) at dlopen.c:66
#7  0x0000007fb7728694 in __GI__dl_catch_exception (exception=0x7fb7ffe7a8 <__stack_chk_guard>, exception@entry=0x7fffffa5f0, operate=0x7fffffa44c, args=0x7fffffa5d0) at dl-error-skeleton.c:196
#8  0x0000007fb7728738 in __GI__dl_catch_error (objname=0x555589f400, errstring=0x555589f408, mallocedp=0x555589f3f8, operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:215
#9  0x0000007fb75f6780 in _dlerror_run (operate=operate@entry=0x7fb75f4fb0 <dlopen_doit>, args=0x7fffffa658, args@entry=0x7fffffa668) at dlerror.c:162
#10 0x0000007fb75f50e8 in __dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:87
#11 0x0000007fb002cc24 in cudnnCreateConvolutionDescriptor () at /usr/lib/aarch64-linux-gnu/libcudnn.so.8
#12 0x0000007f92f5e028 in nvinfer1::rt::cuda::CudnnConvolutionRunner::allocateContextResources(nvinfer1::rt::CommonContext const&, nvinfer1::rt::ExecutionParameters&) ()
    at /usr/lib/aarch64-linux-gnu/libnvinfer.so.7
#13 0x0000007f92f1eb14 in nvinfer1::rt::SafeExecutionContext::setDeviceMemoryInternal(void*) () at /usr/lib/aarch64-linux-gnu/libnvinfer.so.7
#14 0x0000007f92f23f78 in nvinfer1::rt::SafeExecutionContext::SafeExecutionContext(nvinfer1::rt::SafeEngine const&, bool) () at /usr/lib/aarch64-linux-gnu/libnvinfer.so.7
#15 0x0000007f92ca9614 in nvinfer1::rt::ExecutionContext::ExecutionContext(nvinfer1::rt::Engine const&, bool) () at /usr/lib/aarch64-linux-gnu/libnvinfer.so.7
#16 0x0000007f92ca98a0 in nvinfer1::rt::Engine::createExecutionContext() () at /usr/lib/aarch64-linux-gnu/libnvinfer.so.7
#17 0x0000007fb041ea94 in  () at /opt/nvidia/deepstream/deepstream-5.0/lib/libnvds_infer.so
#18 0x0000007fb03fd45c in  () at /opt/nvidia/deepstream/deepstream-5.0/lib/libnvds_infer.so
#19 0x0000007fb03fdef0 in  () at /opt/nvidia/deepstream/deepstream-5.0/lib/libnvds_infer.so
#20 0x0000007fb03ff6cc in  () at /opt/nvidia/deepstream/deepstream-5.0/lib/libnvds_infer.so
#21 0x0000007fb04000a0 in createNvDsInferContext(INvDsInferContext**, _NvDsInferContextInitParams&, void*, void (*)(INvDsInferContext*, unsigned int, NvDsInferLogLevel, char const*, void*)) ()
    at /opt/nvidia/deepstream/deepstream-5.0/lib/libnvds_infer.so
#22 0x0000007fb07829c4 in  () at /usr/lib/aarch64-linux-gnu/gstreamer-1.0/deepstream/libnvdsgst_infer.so
#23 0x0000007fb7258224 in  () at /usr/lib/aarch64-linux-gnu/libgstbase-1.0.so.0

The Jetson Nano device uses the same docker image that works fine on other Jetson Nano’s that are set up in the same way (flashed with the same Jetpack), so it seems to be installation dependent.
TensorRT version is 7.1.3.
Cuda version is 10.2.
Cudnn version is 8.0.

Looking at the backtrace, it looks like an issue is occurring in cudnnCreateConvolutionDescriptor ().
Do you have a suggestion on how we can resolve this issue?

Thank you for your time.

AastaLLL · September 22, 2021, 3:50am

Hi,

Could you run the model several time to see if there is a failure rate?
If the error occurs every time, would you mind reconverting a TensorRT engine and try it again?

Thanks.

joranvandew · September 22, 2021, 8:34am

The failure rate is 100%, it can never load the engine.
Reconverting the engine does not solve the issue (reconverting on a different, working system and updating the docker image with this new engine).
If I remove the engine in the docker container and let the faulty setup reconvert the engine, the same segmentation fault occurs as when trying to load the engine.

joranvandew · September 24, 2021, 2:15pm

Hello,

We produced a minimal failure case for debugging the issue.
On a different Jetson Nano (with the same Jetpack setup as specified earlier) we installed deepstream 5.0.
We modified the deepstream_test1_app.c file to use fakesink instead of eglsink for testing purposes, as follows:
deepstream_test1_app.c (10.3 KB)
We ran make to build the deepstream-test1-app executable.

We set up a Dockerfile using the deepstream-l4t docker image as follows, which we built into an image:
Dockerfile (397 Bytes)

We pushed this docker image to the Jetson Nano that is producing the issue.
To test the application, we run the following commands:

docker run -it --rm --net=host --runtime nvidia <Image ID>
cd /opt/nvidia/deepstream/deepstream-5.0/sources/apps/sample_apps/deepstream-test1
./deepstream-test1-app ../../../../samples/streams/sample_720p.h264

This runs fine on the original Jetson Nano, but produces the same Segmentation fault as documented earlier when ran on the Jetson Nano which is producing the issue.

The backtrace is as follows:

#0  0x0000007fb7fdc1d4 in elf_machine_rela_relative (reloc_addr_arg=0x7f538e5000, reloc=0x7f5fddd000, l_addr=546862682112)
    at ../sysdeps/aarch64/dl-machine.h:376
#1  0x0000007fb7fdc1d4 in elf_dynamic_do_Rela (skip_ifunc=0, lazy=0, nrelative=<optimized out>, relsize=<optimized out>, reladdr=<optimized out>, map=0x557792c660) at do-rel.h:112
#2  0x0000007fb7fdc1d4 in _dl_relocate_object (scope=<optimized out>, reloc_mode=reloc_mode@entry=0, consider_profiling=<optimized out>, consider_profiling@entry=0) at dl-reloc.c:258
#3  0x0000007fb7fe2a1c in dl_open_worker (a=0x7fffff5c88) at dl-open.c:382
#4  0x0000007fb7ca9694 in __GI__dl_catch_exception (exception=0xfffffffffffffffe, operate=0x7fffff5aac, args=0x7fffff5c70)
    at dl-error-skeleton.c:196
#5  0x0000007fb7fe2418 in _dl_open (file=0x7fb02dba80 "libcudnn_cnn_infer.so.8", mode=-2147483646, caller_dlopen=0x7fb02cbc24 <cudnnCreateConvolutionDescriptor+156>, nsid=-2, argc=2, argv=0x7ffffff338, env=<optimized out>) at dl-open.c:605
#6  0x0000007fb7abd014 in dlopen_doit (a=0x7fffff5f48) at dlopen.c:66
#7  0x0000007fb7ca9694 in __GI__dl_catch_exception (exception=0x7fb7ffe7a8 <__stack_chk_guard>,
    exception@entry=0x7fffff5ee0, operate=0x7fffff5d3c, args=0x7fffff5ec0) at dl-error-skeleton.c:196
#8  0x0000007fb7ca9738 in __GI__dl_catch_error (objname=0x55557a86e0, errstring=0x55557a86e8, mallocedp=0x55557a86d8, operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:215
#9  0x0000007fb7abe780 in _dlerror_run (operate=operate@entry=0x7fb7abcfb0 <dlopen_doit>, args=0x7fffff5f48,
    args@entry=0x7fffff5f58) at dlerror.c:162
#10 0x0000007fb7abd0e8 in __dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:87
#11 0x0000007fb02cbc24 in cudnnCreateConvolutionDescriptor () at /usr/lib/aarch64-linux-gnu/libcudnn.so.8
#12 0x0000007f931defc8 in nvinfer1::rt::cuda::CudnnConvolutionRunner::createConvolutionDescriptors(nvinfer1::CuResource<cudnnConvolutionStruct*, &cudnnCreateConvolutionDescriptor, &cudnnDestroyConvolutionDescriptor>&, nvinfer1::CuResource<cudnnFilterStruct*, &cudnnCreateFilterDescriptor, &cudnnDestroyFilterDescriptor>&, nvinfer1::utils::TensorLayout const&, nvinfer1::rt::CommonContext const&, nvinfer1::rt::GenericCudnnTactic<cudnnConvolutionFwdAlgo_t, 7>) const () at /usr/lib/aarch64-linux-gnu/libnvinfer.so.7
#13 0x0000007f9310ad3c in nvinfer1::builder::CudnnConvolutionBuilder::getValidTactics(nvinfer1::builder::EngineBuildContext const&) () at /usr/lib/aarch64-linux-gnu/libnvinfer.so.7
#14 0x0000007f9304209c in nvinfer1::cudnn::getMaxPersistentMem(nvinfer1::builder::EngineBuildContext const&, std::unique_ptr<nvinfer1::builder::RunnerBuilder, std::default_delete<nvinfer1::builder::RunnerBuilder> > const&) ()
    at /usr/lib/aarch64-linux-gnu/libnvinfer.so.7
#15 0x0000007f93042244 in nvinfer1::cudnn::getMaxPersistentMem(nvinfer1::builder::EngineBuildContext const&, std::vector<std::uniqu---Type <return> to continue, or q <return> to quit---q

Does this help to identify the issue?
It occurs every time we try to run a Deepstream application on this Jetson Nano.

AastaLLL · September 27, 2021, 7:21am

Thanks for the data.

We are checking this issue internally.
Will update more information later.

More, could you share your working and non-working is A02 or B01?

Thanks.

joranvandew · September 30, 2021, 7:30am

Both the working and non-working Jetson are the B01 version.

AastaLLL · October 4, 2021, 5:33am

Hi,

It seems there is some issue with that non-working Nano.
Please RMA the device to get a new one.

Thanks.

system · October 26, 2021, 2:20am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.