Segfault when importing ONNX model

When loading an ONNX model (converted from pytorch), the program segfaults with the following backtrace:

#0  0x0000007fb0d6d36c in __dynamic_cast () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6
#1  0x0000007fb10fcb08 in onnx2trt::TypeSerializingPlugin::clone() const () from /usr/lib/aarch64-linux-gnu/libnvonnxparser.so.0
#2  0x0000007fb19b09b0 in nvinfer1::builder::createNode(nvinfer1::ILayer&, nvinfer1::DeviceType) () from /usr/lib/aarch64-linux-gnu/lib
vinfer.so.5
#3  0x0000007fb19b6788 in nvinfer1::builder::buildGraph(nvinfer1::CudaEngineBuildConfig const&, nvinfer1::builder::Graph&, nvinfer1::Ne
work const&) () from /usr/lib/aarch64-linux-gnu/libnvinfer.so.5
#4  0x0000007fb19b79c0 in nvinfer1::builder::buildEngine(nvinfer1::CudaEngineBuildConfig&, nvinfer1::rt::HardwareContext const&, nvinfe
1::Network const&) () from /usr/lib/aarch64-linux-gnu/libnvinfer.so.5
#5  0x0000007fb19ce7cc in nvinfer1::builder::Builder::buildCudaEngine(nvinfer1::INetworkDefinition&) () from /usr/lib/aarch64-linux-gnu
libnvinfer.so.5
...

Through gdb sleuthing, it seems to be crashing when creating a plugin layer which is most likely a Leaky ReLU. Beyond that, I’m not sure how to further debug this. Any ideas? Are plugin layers supported with ONNX models in TensorRT?

Environment details:
platform: Xavier
tensorrt: 5.0.0.8
pytorch: 1.0.0

Thanks!

Hello,

yes, plugins support ONNX models. Have you tried this on a Linux host? (non xavier). Also, to help us debug, please share a small repro containing the onnx model and conversion code that demonstrate the segfault you are seeing.

regards,
NVIDIA Enterprise Support

I have not yet tried on a non-xavier host. I’ve sent a model reproducing the issue via DM.

Thanks

hello,

Per engineering

We do not see the model segfaulting on TRT 5.x build. Tested on Xavier DDPX board running D5L. Here is the output

./build/aarch64-linux/trtexec --onnx=20190116_problematic_convtrans_frozen.onnx
&&&& RUNNING TensorRT.trtexec # ./build/aarch64-linux/trtexec --onnx=20190116_problematic_convtrans_frozen.onnx
[I] onnx: 20190116_problematic_convtrans_frozen.onnx

----------------------------------------------------------------
Input filename:   20190116_problematic_convtrans_frozen.onnx
ONNX IR version:  0.0.3
Opset version:    9
Producer name:    pytorch
Producer version: 0.4
Domain:           
Model version:    0
Doc string:       
----------------------------------------------------------------
[I] [TRT] 17:Conv -> (64, 256, 512)
[I] [TRT] 18:BatchNormalization -> (64, 256, 512)
[I] [TRT] 19:ConvTranspose -> (64, 512, 1024)
[I] [TRT] 20:BatchNormalization -> (64, 512, 1024)
[I] [TRT] 21:LeakyRelu -> (64, 512, 1024)
[I] [TRT] output1:Conv -> (3, 512, 1024)
 ----- Parsing of ONNX model 20190116_problematic_convtrans_frozen.onnx is Done ---- 
...
[I] Average over 10 runs is 185.69 ms (host walltime is 185.795 ms, 99% percentile time is 185.756).
&&&& PASSED TensorRT.trtexec # ./build/aarch64-linux/trtexec --onnx=20190116_problematic_convtrans_frozen.onnx

Please review your configuration.

Hi,

I updated from jetpack 4.0 (I believe) to 4.1.1 this week, and also found that this segfault no longer occurs which is great!

We are seeing a new, nondeterministic crash when building TensorRT engines with our ONNX models. It occurs about 9 out of 10 times we try to build the engine, but eventually we are able to build the engine and run inference. It always occurs during the “Fusing convolution weights” process, but not always at the same layer. Is it possible there is a race condition somewhere around this code?

Here’s the backtrace from this crash:

INFO: Fusing convolution weights from (Unnamed Layer* 35) [Convolution] with scale (Unnamed Layer* 36) [Scale]

Thread 1 "" received signal SIGSEGV, Segmentation fault.
__memcpy_generic () at ../sysdeps/aarch64/multiarch/../memcpy.S:170
170	../sysdeps/aarch64/multiarch/../memcpy.S: No such file or directory.
(gdb) bt
#0  __memcpy_generic () at ../sysdeps/aarch64/multiarch/../memcpy.S:170
#1  0x0000007fb05a30f0 in ?? () from /usr/lib/aarch64-linux-gnu/libnvinfer.so.5
#2  0x0000007fb059bf88 in nvinfer1::builder::fuseScale(nvinfer1::builder::Graph&, nvinfer1::CpuMemoryGroup&) () from /usr/lib/aarch64-linux-gnu/libnvinfer.so.5
#3  0x0000007fb059c48c in nvinfer1::builder::applyGenericOptimizations(nvinfer1::builder::Graph&, nvinfer1::CpuMemoryGroup&, nvinfer1::CudaEngineBuildConfig const&) ()
   from /usr/lib/aarch64-linux-gnu/libnvinfer.so.5
#4  0x0000007fb056442c in nvinfer1::builder::buildEngine(nvinfer1::CudaEngineBuildConfig&, nvinfer1::rt::HardwareContext const&, nvinfer1::Network const&) () from /usr/lib/aarch64-linux-gnu/libnvinfer.so.5
#5  0x0000007fb05cf2ec in nvinfer1::builder::Builder::buildCudaEngine(nvinfer1::INetworkDefinition&) () from /usr/lib/aarch64-linux-gnu/libnvinfer.so.5
#6  0x000000555555ffb0 in createCudaEngine(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, nvinfer1::DataType) ()
#7  0x000000555556093c in getCudaEngine(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, nvinfer1::DataType) ()
#8  0x000000555555eeb4 in main ()

The model I sent doesn’t reproduce this issue, probably not enough conv layers. I can see about recreating with a different model if needed.

Thanks!

can you share the new model and build code? are you building on linux host?

This is on Xavier. I’ll see about creating a model that reproduces this issue and share with you soon.

Ah, disregard! Seems to be a lifetime issue with the parser object. The parser was being destructed after parser::parseFromFile() but before builder::buildCudaEngine(), which is probably problematic.