Two bugs when converting onnx ReID model to tensorrt

bindog · December 18, 2019, 10:35am

Hi, when I was trying converting a ReID (person re-identification) model into tensorrt, I encounter two bugs when using onnx-tensorrt tool (GitHub - onnx/onnx-tensorrt: ONNX-TensorRT: TensorRT backend for ONNX), then I use gdb doing some debug, but because TensorRT is not open sourced, I can’t going down further, so I hope get some help here

The first bug is related to split operation, and it happens after this commit (https://github.com/onnx/onnx-tensorrt/commit/2066f534f66320b7ecdf3eccbaf18ff1fdba6287)，according to the commit message, it adds the support for dynamic split, but I found this commit is not compatible with the static split in our model, after this line (https://github.com/onnx/onnx-tensorrt/blob/2066f534f66320b7ecdf3eccbaf18ff1fdba6287/builtin_op_importers.cpp#L1888), the output shape turns out to be (-1, -1, -1, -1), and in the previous version before this commit, the output shape is like (32, 64, 32, 32) or something similar.

The second bug is related to Instance Normalization(I guess), it crashes at this line of code (https://github.com/onnx/onnx-tensorrt/blob/8716c9b32dcc947287f2ede9ef7d563601bb2ee0/main.cpp#L245), when I using gdb backtrace, it seems like the error happens at tensorrt trying to save something like PluginV2Param into a cache, and as far as I know, the only op in our model which needs tensorrt plugin is Instance Normalization.

Here is the model download url(Dropbox - File Deleted) for test.

And the system and cuda version as follows:

GPU: RTX 2070
System: CentOS 7
CUDA: 10.0
CUDNN: 7.6.3
TensorRT: 6.0.1.5

bindog · December 19, 2019, 6:31am

As the TensorRT 7 has been released, I try to use the newer versions, but the second bug still remains…Here is the detailed gdb backtrace results

(gdb) bt
#0  0x00007fffe7c8c3e5 in __strlen_sse2_pminub () from /usr/lib64/libc.so.6
#1  0x00007fffe84d14c1 in length (__s=0x31 <Address 0x31 out of bounds>)
    at /home/xxxx/gcc-5.3.0/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/char_traits.h:267
#2  std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string (this=0x7fffffffc840,
    __s=0x31 <Address 0x31 out of bounds>, __a=...)
    at /home/xxxx/gcc-5.3.0/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/basic_string.tcc:658
#3  0x00007fffe98157cb in nvinfer1::rt::ArchiveWriteUtils::save(nvinfer1::rt::WriteArchive&, nvinfer1::PluginV2Parameters const&) ()
   from /world/data-gpu-94/tensorrt/TensorRT-7.0.0.11/lib/libnvinfer.so.7
#4  0x00007fffe9815b35 in nvinfer1::rt::ArchiveWriteUtils::save(nvinfer1::rt::WriteArchive&, nvinfer1::OptionalValue<nvinfer1::rt::cuda::PluginV2DynamicExtRunner> const&) () from /world/data-gpu-94/tensorrt/TensorRT-7.0.0.11/lib/libnvinfer.so.7
#5  0x00007fffe981b06a in nvinfer1::rt::ArchiveWriteUtils::save(nvinfer1::rt::WriteArchive&, nvinfer1::OptionalValue<nvinfer1::rt::Runner> const&) () from /world/data-gpu-94/tensorrt/TensorRT-7.0.0.11/lib/libnvinfer.so.7
#6  0x00007fffe99fe44d in nvinfer1::cudnn::serializeEngine(nvinfer1::rt::Engine const&) ()
   from /world/data-gpu-94/tensorrt/TensorRT-7.0.0.11/lib/libnvinfer.so.7
#7  0x00007fffe97fcd3f in nvinfer1::rt::Engine::serialize() const ()
   from /world/data-gpu-94/tensorrt/TensorRT-7.0.0.11/lib/libnvinfer.so.7
#8  0x00000000004086d9 in main (argc=<optimized out>, argv=<optimized out>) at /world/data-gpu-94/tensorrt/onnx-tensorrt/main.cpp:245