TensorRT-7.1.3.4 Deserialize the cuda engine failed

Description

$ ./main image
ERROR: /home/jenkins/workspace/TensorRT/helpers/rel-7.1/L1_Nightly_Internal/build/source/rtSafe/resources.h (460) - Cuda Error in loadKernel: -1 (TensorRT internal error)
ERROR: INVALID_STATE: std::exception
ERROR: INVALID_CONFIG: Deserialize the cuda engine failed.
Segmentation fault

the onnx file is ok in cuda10.0 + cudnn7.5 + TensorRT-5.1.5.0,
pytorch to onnx has no error/warning.

value of engine->getNbLayers()
in TensorRT-5.1.5.0 is 40
in TensorRT-7.1.3.4 is 35

Environment

**TensorRT Version7.1.3.4:
**GPU Type Tesla T4:
**Nvidia Driver Version 450.51.05:
**CUDA Version 11.0:
**CUDNN Version cudnn-11.0-linux-x64-v8.0.2.39:
**Operating System + Version ubuntu16.04
**Python Version (if applicable) python3.7:
TensorFlow Version (if applicable):
**PyTorch Version (if applicable) 1.6.0:
Baremetal or Container (if container which image + tag):

Relevant Files

Steps To Reproduce

logs:

ubuntu @ ~/tools/TensorRT-7.1.3.4/bin
$ ./trtexec --onnx=/tmp/test.onnx --shapes=input:32x3x160x96 --explicitBatch --workspace=1024 --fp16 --saveEngine=/tmp/test.engine
&&&& RUNNING TensorRT.trtexec # ./trtexec --onnx=/tmp/test.onnx --shapes=input:32x3x160x96 --explicitBatch --workspace=1024 --fp16 --saveEngine=/tmp/test.engine
[10/14/2020-10:21:48] [I] === Model Options ===
[10/14/2020-10:21:48] [I] Format: ONNX
[10/14/2020-10:21:48] [I] Model: /tmp/test.onnx
[10/14/2020-10:21:48] [I] Output:
[10/14/2020-10:21:48] [I] === Build Options ===
[10/14/2020-10:21:48] [I] Max batch: explicit
[10/14/2020-10:21:48] [I] Workspace: 1024 MB
[10/14/2020-10:21:48] [I] minTiming: 1
[10/14/2020-10:21:48] [I] avgTiming: 8
[10/14/2020-10:21:48] [I] Precision: FP32+FP16
[10/14/2020-10:21:48] [I] Calibration:
[10/14/2020-10:21:48] [I] Safe mode: Disabled
[10/14/2020-10:21:48] [I] Save engine: /tmp/test.engine
[10/14/2020-10:21:48] [I] Load engine:
[10/14/2020-10:21:48] [I] Builder Cache: Enabled
[10/14/2020-10:21:48] [I] NVTX verbosity: 0
[10/14/2020-10:21:48] [I] Inputs format: fp32:CHW
[10/14/2020-10:21:48] [I] Outputs format: fp32:CHW
[10/14/2020-10:21:48] [I] Input build shape: input=32x3x160x96+32x3x160x96+32x3x160x96
[10/14/2020-10:21:48] [I] Input calibration shapes: model
[10/14/2020-10:21:48] [I] === System Options ===
[10/14/2020-10:21:48] [I] Device: 0
[10/14/2020-10:21:48] [I] DLACore:
[10/14/2020-10:21:48] [I] Plugins:
[10/14/2020-10:21:48] [I] === Inference Options ===
[10/14/2020-10:21:48] [I] Batch: Explicit
[10/14/2020-10:21:48] [I] Input inference shape: input=32x3x160x96
[10/14/2020-10:21:48] [I] Iterations: 10
[10/14/2020-10:21:48] [I] Duration: 3s (+ 200ms warm up)
[10/14/2020-10:21:48] [I] Sleep time: 0ms
[10/14/2020-10:21:48] [I] Streams: 1
[10/14/2020-10:21:48] [I] ExposeDMA: Disabled
[10/14/2020-10:21:48] [I] Spin-wait: Disabled
[10/14/2020-10:21:48] [I] Multithreading: Disabled
[10/14/2020-10:21:48] [I] CUDA Graph: Disabled
[10/14/2020-10:21:48] [I] Skip inference: Disabled
[10/14/2020-10:21:48] [I] Inputs:
[10/14/2020-10:21:48] [I] === Reporting Options ===
[10/14/2020-10:21:48] [I] Verbose: Disabled
[10/14/2020-10:21:48] [I] Averages: 10 inferences
[10/14/2020-10:21:48] [I] Percentile: 99
[10/14/2020-10:21:48] [I] Dump output: Disabled
[10/14/2020-10:21:48] [I] Profile: Disabled
[10/14/2020-10:21:48] [I] Export timing to JSON file:
[10/14/2020-10:21:48] [I] Export output to JSON file:
[10/14/2020-10:21:48] [I] Export profile to JSON file:
[10/14/2020-10:21:48] [I]

Input filename: /tmp/test.onnx
ONNX IR version: 0.0.4
Opset version: 9
Producer name: pytorch
Producer version: 1.6
Domain:
Model version: 0
Doc string:

[10/14/2020-10:23:31] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[10/14/2020-10:23:31] [I] Starting inference threads
[10/14/2020-10:23:34] [I] Warmup completed 0 queries over 200 ms
[10/14/2020-10:23:34] [I] Timing trace has 0 queries over 3.00063 s
[10/14/2020-10:23:34] [I] Trace averages of 10 runs:
[10/14/2020-10:23:34] [I] Average on 10 runs - GPU latency: 0.24994 ms - Host latency: 0.416225 ms (end to end 0.478523 ms, enqueue 0.0757492 ms)

0.3979 ms, enqueue 0.0746338 ms)
[10/14/2020-10:23:34] [I] Average on 10 runs - GPU latency: 0.209033 ms - Host latency: 0.37102 ms (end to end 0.403516 ms, enqueue 0.0744385 ms)
[10/14/2020-10:23:34] [I] Host Latency
[10/14/2020-10:23:34] [I] min: 0.357666 ms (end to end 0.368225 ms)
[10/14/2020-10:23:34] [I] max: 0.459991 ms (end to end 0.510849 ms)
[10/14/2020-10:23:34] [I] mean: 0.371668 ms (end to end 0.401351 ms)
[10/14/2020-10:23:34] [I] median: 0.369873 ms (end to end 0.401855 ms)
[10/14/2020-10:23:34] [I] percentile: 0.408752 ms at 99% (end to end 0.467224 ms at 99%)
[10/14/2020-10:23:34] [I] throughput: 0 qps
[10/14/2020-10:23:34] [I] walltime: 3.00063 s
[10/14/2020-10:23:34] [I] Enqueue Time
[10/14/2020-10:23:34] [I] min: 0.0690918 ms
[10/14/2020-10:23:34] [I] max: 0.124512 ms
[10/14/2020-10:23:34] [I] median: 0.0737915 ms
[10/14/2020-10:23:34] [I] GPU Compute
[10/14/2020-10:23:34] [I] min: 0.196289 ms
[10/14/2020-10:23:34] [I] max: 0.274429 ms
[10/14/2020-10:23:34] [I] mean: 0.208811 ms
[10/14/2020-10:23:34] [I] median: 0.207275 ms
[10/14/2020-10:23:34] [I] percentile: 0.246078 ms at 99%
[10/14/2020-10:23:34] [I] total compute time: 2.94236 s
&&&& PASSED TensorRT.trtexec # ./trtexec --onnx=/tmp/test.onnx --shapes=input:32x3x160x96 --explicitBatch --workspace=1024 --fp16 --saveEngine=/tmp/test.engine
ubuntu@ ~/tools/TensorRT-7.1.3.4/bin

{
int nbInput = network->getNbInputs();
auto inDim = network->getInput(0)->getDimensions();
int nbOutput = network->getNbOutputs();
auto outDim = network->getOutput(0)->getDimensions();
printf("%s\n inputs %d, inputDims %d, inputCount %d\n outputs %d, outputDims %d, outputCount %d,
engine_size %d, engine_nbLayers %d\n", output_file.c_str(),
nbInput, inDim.nbDims, samplesCommon::volume(inDim),
nbOutput, outDim.nbDims, samplesCommon::volume(outDim),
data->size(), engine->getNbLayers());
}
inputs 1, inputDims 4, inputCount 491520
outputs 1, outputDims 2, outputCount 96, engine_size 316916, engine_nbLayers 35

ubuntu @ ~/work
$ ./main image
ERROR: /home/jenkins/workspace/TensorRT/helpers/rel-7.1/L1_Nightly_Internal/build/source/rtSafe/resources.h (460) - Cuda Error in loadKernel: -1 (TensorRT internal error)
ERROR: INVALID_STATE: std::exception
ERROR: INVALID_CONFIG: Deserialize the cuda engine failed.

Hi @641263629,
Request you to share your onnx model, so that we can assist you better.

Thanks!

hi,any problem with the onnx file ?

Hi @641263629,
I could not reproduce the issue,
Are you using the same TRT version while deserializing the engine, which you used to create one?
Thanks!

yes, there is only one version of tensorrt installed in my system. does the gpu type matter? T4, 2080 Ti?

Yes it does.
The generated plan files are not portable across platforms or TensorRT versions. Plans are specific to the exact GPU model they were built on (in addition to the platforms and the TensorRT version) and must be re-targeted to the specific GPU in case you want to run them on a different GPU.
Thanks!

I know the reason!
I need tensorrt and libtorch both. build the latest version pytorch need std=c++14, but this setting lead the tensorrt in error。I drop out the libtorch and build my code with std=c++11, run success!
But what could I do to solve this conflict?