sample_mnist segmentation faults on DLA on Xavier

KireinaHoro · December 7, 2018, 4:08pm

I’m using TensorRT 5.0.3 on my Xavier Kit, and I’m having trouble running the MNIST sample on DLA:

jsteward@jetson-0423718016929:~/Work/tensorrt/bin$ ./sample_mnist --useDLACore=1
Building and running a GPU inference engine for MNIST
WARNING: Default DLA is enabled but layer prob is not running on DLA, falling back to GPU.
WARNING: Default DLA is enabled but layer (Unnamed Layer* 9) [Constant] is not running on DLA, falling back to GPU.
WARNING: (Unnamed Layer* 10) [ElementWise]: DLA cores do not support SUB ElementWise operation.
WARNING: Default DLA is enabled but layer (Unnamed Layer* 10) [ElementWise] is not running on DLA, falling back to GPU.
Segmentation fault (core dumped)

Debugging with cuda-gdb on sample_mnist_debug gives the following stack trace:

jsteward@jetson-0423718016929:~/Work/tensorrt/bin$ /usr/local/cuda/bin/cuda-gdb --args ./sample_mnist_debug --useDLACore=1
NVIDIA (R) CUDA Debugger
10.0 release
Portions Copyright (C) 2007-2018 NVIDIA Corporation
GNU gdb (GDB) 7.12
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "aarch64-elf-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./sample_mnist_debug...done.
(cuda-gdb) r
Starting program: /home/jsteward/Work/tensorrt/bin/sample_mnist_debug --useDLACore=1
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
Building and running a GPU inference engine for MNIST
[New Thread 0x7f8d2f51c0 (LWP 5972)]
WARNING: Default DLA is enabled but layer prob is not running on DLA, falling back to GPU.
WARNING: Default DLA is enabled but layer (Unnamed Layer* 9) [Constant] is not running on DLA, falling back to GPU.
WARNING: (Unnamed Layer* 10) [ElementWise]: DLA cores do not support SUB ElementWise operation.
WARNING: Default DLA is enabled but layer (Unnamed Layer* 10) [ElementWise] is not running on DLA, falling back to GPU.

Thread 1 "sample_mnist_de" received signal SIGSEGV, Segmentation fault.
0x0000007f94632ae8 in nvdla::tensorrt::destroyTensorRTParser9(nvdla::tensorrt::ITensorRTParser*) () from /usr/lib/aarch64-linux-gnu/tegra/libnvdla_compiler.so
(cuda-gdb) bt
#0  0x0000007f94632ae8 in nvdla::tensorrt::destroyTensorRTParser9(nvdla::tensorrt::ITensorRTParser*) () from /usr/lib/aarch64-linux-gnu/tegra/libnvdla_compiler.so
#1  0x0000007fb09c6640 in nvinfer1::utility::dla::TmpWisdom::compile(int, int) () from /usr/lib/aarch64-linux-gnu/libnvinfer.so.5
#2  0x0000007fb09d2840 in nvinfer1::builder::dla::validateGraphNode(std::unique_ptr<nvinfer1::builder::Node, std::default_delete<nvinfer1::builder::Node> > const&) ()
   from /usr/lib/aarch64-linux-gnu/libnvinfer.so.5
#3  0x0000007fb09422ac in nvinfer1::builder::createForeignNodes(nvinfer1::builder::Graph&, nvinfer1::builder::ForeignNode* (*)(nvinfer1::Backend, std::string const&), nvinfer1::CudaEngineBuildConfig const&) () from /usr/lib/aarch64-linux-gnu/libnvinfer.so.5
#4  0x0000007fb098e504 in nvinfer1::builder::applyGenericOptimizations(nvinfer1::builder::Graph&, nvinfer1::CpuMemoryGroup&, nvinfer1::CudaEngineBuildConfig const&) ()
   from /usr/lib/aarch64-linux-gnu/libnvinfer.so.5
#5  0x0000007fb095642c in nvinfer1::builder::buildEngine(nvinfer1::CudaEngineBuildConfig&, nvinfer1::rt::HardwareContext const&, nvinfer1::Network const&) ()
   from /usr/lib/aarch64-linux-gnu/libnvinfer.so.5
#6  0x0000007fb09c12ec in nvinfer1::builder::Builder::buildCudaEngine(nvinfer1::INetworkDefinition&) () from /usr/lib/aarch64-linux-gnu/libnvinfer.so.5
#7  0x00000055555589ec in SampleMNIST::build (this=0x7ffffff178) at sampleMNIST.cpp:121
#8  0x0000005555559c5c in main (argc=2, argv=0x7ffffff3c8) at sampleMNIST.cpp:332
(cuda-gdb) display/i $pc
1: x/i $pc
=> 0x7f94632ae8 <_ZN5nvdla8tensorrt22destroyTensorRTParser9EPNS0_15ITensorRTParserE+96420>:     ldr     x2, [x0]
(cuda-gdb) p/x $x0
$1 = 0x0
(cuda-gdb)

Seems like the DLA is doing a double-free (nullptr deref) somewhere deep in the libraries. Please look into this issue, thanks.

NVES · December 13, 2018, 5:53pm

apologize for the delay, we are reviewing this now.

KireinaHoro · December 20, 2018, 9:59am

Any progress on this topic? It’s been a while.

NVES · January 4, 2019, 3:41pm

Hello,

This has been committed and fixed in the next version of TensorRT, which should be available very soon (unfortunately, I can discuss the exact schedule here). Please stay tuned for the announcement.

KireinaHoro · January 5, 2019, 12:07am

Hi,

So will this be released separately or as a part of a new JetPack release?

NVES · January 8, 2019, 9:41pm

very likely as a new Jetpack release.