Crash in deserializeCudaEngine

yukigaru · July 10, 2019, 6:13am

Hi! When I serialize a particular network and then deserialize it, it crashes somewhere inside the nvinfer.dll 4 levels deep without a callstack. It happens with one particular network, not with all, it’s huge (~2.1Gb) and contains custom layers (which are implemented).

On Windows there’s no info, but on Ubuntu I get traces:
trt: runtime.cpp (24) - Cuda Error in allocate: 2
trt: cuda/cudaFusedConvActLayer.cpp (287) - Cuda Error in executeFused: 2
trt: cuda/cudaFusedConvActLayer.cpp (287) - Cuda Error in executeFused: 2

Could you please give advice on what might be happening? Thanks!

Reproduced in environments:
OS: Windows 10 x64, TensorRT 5.1.5.0 (for CUDA 10.1), CUDA 10.1.105, MSVC 2017, GPU GTX 1060.
OS: Ubuntu 16.04

yukigaru · July 10, 2019, 8:33am

It seems that amount of bytes - 2.1Gb is too big for 32-bit integer that is used somewhere in TensorRT

yukigaru · July 10, 2019, 9:16am

Here’s more detailed information from Ubuntu 16.04:
assert: /home/erisuser/p4sw/sw/gpgpu/MachineLearning/DIT/externals/flatbuffers-1.1.0/include/flatbuffers/flatbuffers.h:413: uint8_t* nvinferFlatBuffers::flatbuffers::vector_downward::make_space(size_t): Assertion` size() < (1UL << (sizeof(soffset_t) * 8 - 1)) - 1’ failed.

backtrace:
#3 0x00007fffd5d91c82 in __GI___assert_fail (assertion=0x7fffd799b320 “size() < (1UL << (sizeof(soffset_t) * 8 - 1)) - 1”, file=0x7fffd7999a78 “/home/erisuser/p4sw/sw/gpgpu/MachineLearning/DIT/externals/flatbuffers-1.1.0/include/f
latbuffers/flatbuffers.h”, line=0x19d, function=0x7fffd799ea80 “uint8_t* nvinferFlatBuffers::flatbuffers::vector_downward::make_space(size_t)”) at assert.c:101
#4 0x00007fffd72c278c in ?? () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.5
#5 0x00007fffd72b999f in nvinfer1::rt::serializeGlob(CUstream_st*, nvinferFlatBuffers::flatbuffers::FlatBufferBuilder&, nvinfer1::GpuMemory const&) () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.5
#6 0x00007fffd73bfcb5 in nvinfer1::cudnn::serializeEngine(nvinfer1::rt::Engine const&) () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.5

size() == 0x7fffffff

NVES_R · July 15, 2019, 9:51pm

Hi, can you provide the following details on the platforms you are using?

Linux distro and version
Ubuntu 16.04
GPU type
GTX 1060
Nvidia driver version
CUDA version
CUDNN version
Python version [if using python]
TensorFlow version
TensorRT version
Any source files and models you can provide will help us reproduce your issue and further debug it. You can private message these if you don’t want them to be public.

Thanks,
NVIDIA Enterprise Support