crash when converting onnx ReID model to tensorrt

Hi, when I was trying converting a ReID (person re-identification) model into tensorrt, it crashes when using onnx-tensorrt tool (https://github.com/onnx/onnx-tensorrt), then I use gdb doing some debug, but because TensorRT is not open sourced, I can’t going down further, so I hope get some help here.

the crash is related to Instance Normalization(I guess), it crashes at this line of code https://github.com/onnx/onnx-tensorrt/blob/8716c9b32dcc947287f2ede9ef7d563601bb2ee0/main.cpp#L245 , when I using gdb backtrace, it seems like the error happens at tensorrt trying to save something like PluginV2Param into a archive, and as far as I know, the only op in our model which needs tensorrt plugin is Instance Normalization.

I try it on tensorrt 6 and the new released tensorrt 7, both have the same issue

Here is the gdb backtrace details:

./onnx2trt resnet_50_ibn_a_op10.onnx -o test.engine

(gdb) bt
#0  0x00007fffe7c8c3e5 in __strlen_sse2_pminub () from /usr/lib64/libc.so.6
#1  0x00007fffe84d14c1 in length (__s=0x31 <Address 0x31 out of bounds>)
    at /home/xxxx/gcc-5.3.0/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/char_traits.h:267
#2  std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string (this=0x7fffffffc840,
    __s=0x31 <Address 0x31 out of bounds>, __a=...)
    at /home/xxxx/gcc-5.3.0/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/basic_string.tcc:658
#3  0x00007fffe98157cb in nvinfer1::rt::ArchiveWriteUtils::save(nvinfer1::rt::WriteArchive&, nvinfer1::PluginV2Parameters const&) ()
   from /world/data-gpu-94/tensorrt/TensorRT-7.0.0.11/lib/libnvinfer.so.7
#4  0x00007fffe9815b35 in nvinfer1::rt::ArchiveWriteUtils::save(nvinfer1::rt::WriteArchive&, nvinfer1::OptionalValue<nvinfer1::rt::cuda::PluginV2DynamicExtRunner> const&) () from /world/data-gpu-94/tensorrt/TensorRT-7.0.0.11/lib/libnvinfer.so.7
#5  0x00007fffe981b06a in nvinfer1::rt::ArchiveWriteUtils::save(nvinfer1::rt::WriteArchive&, nvinfer1::OptionalValue<nvinfer1::rt::Runner> const&) () from /world/data-gpu-94/tensorrt/TensorRT-7.0.0.11/lib/libnvinfer.so.7
#6  0x00007fffe99fe44d in nvinfer1::cudnn::serializeEngine(nvinfer1::rt::Engine const&) ()
   from /world/data-gpu-94/tensorrt/TensorRT-7.0.0.11/lib/libnvinfer.so.7
#7  0x00007fffe97fcd3f in nvinfer1::rt::Engine::serialize() const ()
   from /world/data-gpu-94/tensorrt/TensorRT-7.0.0.11/lib/libnvinfer.so.7
#8  0x00000000004086d9 in main (argc=<optimized out>, argv=<optimized out>) at /world/data-gpu-94/tensorrt/onnx-tensorrt/main.cpp:245

you can download our demo model (https://www.dropbox.com/s/g2rxzx45vfb7q9h/resnet_50_ibn_a_op10.onnx?dl=0) for test

Hi,

Could you please try to use “trtexec” command to convert the model?
“–verbose” mode will help you debug the issue.
“trtexec” is useful for benchmarking networks and would be faster and easier to debug the issue.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

I just tried trtexec cmd on “tensorrt:19.11-py3” NGC image (TRT 6) and was able to successfully convert the model.

trtexec --onnx=resnet_50_ibn_a_op10.onnx --verbose --saveEngine=model_test.trt

https://docs.nvidia.com/deeplearning/sdk/tensorrt-container-release-notes/rel_19-11.html

Thanks

Thanks for your reply!

I try the original Tensor 6.0.1.5 (same as the version in docker), without compile the TensorRT from github (which will replace the trtexec and some other library), and it works fine, thank you!

I also find the result of the converted tensorrt engine differs a lot with the original onnx, I will keep on debugging on it and open another topic.

But the same error happens in the new released TensorRT 7, I think you still need to check this issue, and here is the gdb backtrace when I use the trtexec in TensorRT 7 (I did not compile the TensorRT from gihtub, just the original trtexec)

(gdb) bt
#0  0x00007fffca0963c1 in __strlen_sse2_pminub () from /usr/lib64/libc.so.6
#1  0x00007fffca8db4c1 in length (__s=0x100000000 <Address 0x100000000 out of bounds>)
    at /home/xxxx/gcc-5.3.0/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/char_traits.h:267
#2  std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string (this=0x7fffffffc050,
    __s=0x100000000 <Address 0x100000000 out of bounds>, __a=...)
    at /home/xxxx/gcc-5.3.0/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/basic_string.tcc:658
#3  0x00007fffea4a57cb in nvinfer1::rt::ArchiveWriteUtils::save(nvinfer1::rt::WriteArchive&, nvinfer1::PluginV2Parameters const&) ()
   from /world/data-gpu-94/tensorrt/TensorRT-7.0.0.11/lib/libnvinfer.so.7
#4  0x00007fffea4a5b35 in nvinfer1::rt::ArchiveWriteUtils::save(nvinfer1::rt::WriteArchive&, nvinfer1::OptionalValue<nvinfer1::rt::cuda::PluginV2DynamicExtRunner> const&) () from /world/data-gpu-94/tensorrt/TensorRT-7.0.0.11/lib/libnvinfer.so.7
#5  0x00007fffea4ab06a in nvinfer1::rt::ArchiveWriteUtils::save(nvinfer1::rt::WriteArchive&, nvinfer1::OptionalValue<nvinfer1::rt::Runner> const&) () from /world/data-gpu-94/tensorrt/TensorRT-7.0.0.11/lib/libnvinfer.so.7
#6  0x00007fffea68e44d in nvinfer1::cudnn::serializeEngine(nvinfer1::rt::Engine const&) ()
   from /world/data-gpu-94/tensorrt/TensorRT-7.0.0.11/lib/libnvinfer.so.7
#7  0x00007fffea48cd3f in nvinfer1::rt::Engine::serialize() const ()
   from /world/data-gpu-94/tensorrt/TensorRT-7.0.0.11/lib/libnvinfer.so.7
#8  0x0000000000425529 in sample::saveEngine(nvinfer1::ICudaEngine const&, std::string const&, std::ostream&) ()
#9  0x0000000000427837 in sample::getEngine(sample::ModelOptions const&, sample::BuildOptions const&, sample::SystemOptions const&, std::ostream&) ()
#10 0x0000000000404ec3 in main ()

Hi,

In TRT 7, ONNX parser supports full-dimensions mode only. Your network definition must be created with the explicitBatch flag set (when using ONNX parser)
Please refer to below link for more details:
https://docs.nvidia.com/deeplearning/sdk/tensorrt-release-notes/tensorrt-7.html

Related: https://github.com/NVIDIA/TensorRT/issues/283

Thanks

I’m afraid this error has nothing to do with the explicitBatch, I’ve try the explicitBatch flag and it still crash. Actually I did a lot of experiments and now I’m pretty sure this error is caused by Instance Normalization plugin( as I said before)

If I remove the Instance Normalization layer from the model everything works fine. I have upload the model with and without the Instance Normalization for you to test.

model with Instance Normalization (https://www.dropbox.com/s/048l38k7grlffnf/resnet_50_ibn_a_with_in.onnx?dl=0)
model without Instance Normalization (https://www.dropbox.com/s/xx9t42eo3pafspp/resnet_50_without_in.onnx?dl=0)

Now the situation is:

TRT 6, model with IN, normal
TRT 6, model without IN, normal
TRT 7, model with IN, crash
TRT 7, model without IN, normal

I guess there maybe something inconsistent between TRT 6 and 7, maybe the name of the plugin or something else…

here is the log when trtexec in TRT 6 and 7 register the plugin

TRT 6

[11/20/2019-18:05:10] [V] [TRT] Plugin Creator registration succeeded - GridAnchor_TRT
[11/20/2019-18:05:10] [V] [TRT] Plugin Creator registration succeeded - NMS_TRT
[11/20/2019-18:05:10] [V] [TRT] Plugin Creator registration succeeded - Reorg_TRT
[11/20/2019-18:05:10] [V] [TRT] Plugin Creator registration succeeded - Region_TRT
[11/20/2019-18:05:10] [V] [TRT] Plugin Creator registration succeeded - Clip_TRT
[11/20/2019-18:05:10] [V] [TRT] Plugin Creator registration succeeded - LReLU_TRT
[11/20/2019-18:05:10] [V] [TRT] Plugin Creator registration succeeded - PriorBox_TRT
[11/20/2019-18:05:10] [V] [TRT] Plugin Creator registration succeeded - Normalize_TRT
[11/20/2019-18:05:10] [V] [TRT] Plugin Creator registration succeeded - RPROI_TRT
[11/20/2019-18:05:10] [V] [TRT] Plugin Creator registration succeeded - BatchedNMS_TRT
[11/20/2019-18:05:10] [V] [TRT] Plugin Creator registration succeeded - FlattenConcat_TRT

TRT 7

[12/20/2019-17:58:00] [V] [TRT] Plugin creator registration succeeded - ::GridAnchor_TRT
[12/20/2019-17:58:00] [V] [TRT] Plugin creator registration succeeded - ::NMS_TRT
[12/20/2019-17:58:00] [V] [TRT] Plugin creator registration succeeded - ::Reorg_TRT
[12/20/2019-17:58:00] [V] [TRT] Plugin creator registration succeeded - ::Region_TRT
[12/20/2019-17:58:00] [V] [TRT] Plugin creator registration succeeded - ::Clip_TRT
[12/20/2019-17:58:00] [V] [TRT] Plugin creator registration succeeded - ::LReLU_TRT
[12/20/2019-17:58:00] [V] [TRT] Plugin creator registration succeeded - ::PriorBox_TRT
[12/20/2019-17:58:00] [V] [TRT] Plugin creator registration succeeded - ::Normalize_TRT
[12/20/2019-17:58:00] [V] [TRT] Plugin creator registration succeeded - ::RPROI_TRT
[12/20/2019-17:58:00] [V] [TRT] Plugin creator registration succeeded - ::BatchedNMS_TRT
[12/20/2019-17:58:00] [V] [TRT] Plugin creator registration succeeded - ::FlattenConcat_TRT
[12/20/2019-17:58:00] [V] [TRT] Plugin creator registration succeeded - ::CropAndResize
[12/20/2019-17:58:00] [V] [TRT] Plugin creator registration succeeded - ::DetectionLayer_TRT
[12/20/2019-17:58:00] [V] [TRT] Plugin creator registration succeeded - ::Proposal
[12/20/2019-17:58:00] [V] [TRT] Plugin creator registration succeeded - ::ProposalLayer_TRT
[12/20/2019-17:58:00] [V] [TRT] Plugin creator registration succeeded - ::PyramidROIAlign_TRT
[12/20/2019-17:58:00] [V] [TRT] Plugin creator registration succeeded - ::ResizeNearest_TRT
[12/20/2019-17:58:00] [V] [TRT] Plugin creator registration succeeded - ::Split
[12/20/2019-17:58:00] [V] [TRT] Plugin creator registration succeeded - ::SpecialSlice_TRT
[12/20/2019-17:58:00] [V] [TRT] Plugin creator registration succeeded - ::InstanceNormalization_TRT

the name of the plugin in TRT 7 looks strange, why there is a “::” before the plugin name? And why the Instance Normalization plugin is not in the log of TRT 6 but it can convert correctly?

Anyway, it’s only my guess, I hope you can check the backtrace of gdb, and locate where the problem is…

Hi,
We tried both the models on TRT 6 & TRT 7 using below command:

trtexec --onnx=resnet_50_ibn_a_with_in.onnx  --verbose

And it seems to be working fine. (Current set-up doesn’t use any OSS component)

Could you please share the error log that we are getting while running this model along with platform information so we can better help?

Thanks

Hi,

I think you miss an important arg to reproduce this error, --saveEngine=test_model.trt

I will show you how to reproduce this error, to avoid platform difference, I used the NGC image “19.12-py3” (https://docs.nvidia.com/deeplearning/sdk/tensorrt-container-release-notes/rel_19-12.html#rel_19-12), the command I start the docker container is

sudo docker run -it --rm --gpus device=0 --privileged --mount type=bind,source=/home/xxxx/tensorrtbk,target=/opt/tensorrtbk nvcr.io/nvidia/tensorrt:19.12-py3 /bin/bash

enter docker container and download TensorRT-7.0.0.11.Ubuntu-18.04.x86_64-gnu.cuda-10.2.cudnn7.6.tar.gz, extract it to /opt/tensorrtbk, then I set the environment variables below

export TRT_RELEASE=/opt/tensorrtbk/TensorRT-7.0.0.11
export TENSORRT_INCLUDE_DIR=$TRT_RELEASE/include
export TENSORRT_LIBRARY=$TRT_RELEASE/lib
export LD_LIBRARY_PATH=$TRT_RELEASE/lib:$LD_LIBRARY_PATH
export LIBRARY_PATH=$TRT_RELEASE/lib:$LIBRARY_PATH

then execute the command under /opt/tensorrtbk/TensorRT-7.0.0.11/bin

./trtexec --onnx=/opt/tensorrtbk/resnet_50_ibn_a_with_in.onnx --verbose --saveEngine=/opt/tensorrtbk/model_test.trt

and you will see the Segmentation fault (core dumped), and when I used the gdb backtrace, the crash point is similar to my host environment

(gdb) bt
#0  __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:93
#1  0x00007fffc5833b11 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x00007fffe931707d in nvinfer1::rt::ArchiveWriteUtils::save(nvinfer1::rt::WriteArchive&, nvinfer1::PluginV2Parameters const&) ()
   from /opt/tensorrtbk/TensorRT-7.0.0.11/lib/libnvinfer.so.7
#3  0x00007fffe9317b82 in nvinfer1::rt::ArchiveWriteUtils::save(nvinfer1::rt::WriteArchive&, nvinfer1::OptionalValue<nvinfer1::rt::cuda::PluginV2DynamicExtRunner> const&) () from /opt/tensorrtbk/TensorRT-7.0.0.11/lib/libnvinfer.so.7
#4  0x00007fffe931e195 in nvinfer1::rt::ArchiveWriteUtils::save(nvinfer1::rt::WriteArchive&, nvinfer1::OptionalValue<nvinfer1::rt::Runner> const&) () from /opt/tensorrtbk/TensorRT-7.0.0.11/lib/libnvinfer.so.7
#5  0x00007fffe950bf58 in nvinfer1::cudnn::serializeEngine(nvinfer1::rt::Engine const&) ()
   from /opt/tensorrtbk/TensorRT-7.0.0.11/lib/libnvinfer.so.7
#6  0x00007fffe92fe6df in nvinfer1::rt::Engine::serialize() const () from /opt/tensorrtbk/TensorRT-7.0.0.11/lib/libnvinfer.so.7
#7  0x000055555557fb3a in sample::saveEngine(nvinfer1::ICudaEngine const&, std::string const&, std::ostream&) ()
#8  0x000055555558213f in sample::getEngine(sample::ModelOptions const&, sample::BuildOptions const&, sample::SystemOptions const&, std::ostream&) ()
#9  0x0000555555559d53 in main ()

the GPU we used is Tesla V100, and the driver version is 440.33.01

Hi, Do you have any progress on this issue?

Are you still keep tracking on this issue? I think instance normalization is an useful op, especially in areas like style transfer, so please help resolve this issue

Hi,

Sorry for late reply.
Yes, we are looking into this issue. Once we have any updates will share it in the forum.

Thanks

Hi,

Team has submitted the fix for this issue. You can get the fix through OSS repo.
Please let us know in case of any further issues.

Thanks

I try the fix in OSS repo (https://github.com/NVIDIA/TensorRT/commit/090231a93ca6ed54f527f6851122460f221098fe), and I can successfully convert the onnx model to trt engine, but when I try to load the converted trt engine use the code below, it still throws error

with open(self.engine_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
    engine = runtime.deserialize_cuda_engine(f.read())
    self.context = engine.create_execution_context()

and the error message

[TensorRT] ERROR: INVALID_ARGUMENT: getPluginCreator could not find plugin InstanceNormalization_TRT version 001
[TensorRT] ERROR: safeDeserializationUtils.cpp (293) - Serialization Error in load: 0 (Cannot deserialize plugin since corresponding IPluginCreator not found in Plugin Registry)
[TensorRT] ERROR: INVALID_STATE: std::exception
[TensorRT] ERROR: INVALID_CONFIG: Deserialize the cuda engine failed.

Hi,

Could you please check if LD_LIBRARY_PATH is updated correctly to point to latest TRT OSS?

Thanks

Sorry for the late reply.

We have already solve the problem with a single line of code…

trt.init_libnvinfer_plugins(TRT_LOGGER, "")

But we find there exists the precision issue with the instance normalization, we will open another post.