Error deserializing trt engine when bundling with Pyinstaller

System info:

cat /etc/nv_tegra_release
# R32 (release), REVISION: 5.1, GCID: 26202423, BOARD: t186ref, EABI: aarch64, DATE: Fri Feb 19 16:50:29 UTC 2021
sudo apt-cache show nvidia-jetpack | grep Version
Version: 4.5.1-b17
Version: 4.5-b129
  • CUDA version: 10.2
  • CUDNN version: 8.0.0
  • Python version: 3.6.9
  • Tensorflow version: 1.15.4
  • TensorRT version: 7.1.3.0

Here’s the relevant error message while trying to deserialize the TRT engine file during inference:

[TensorRT] ERROR: /home/jenkins/workspace/TensorRT/helpers/rel-7.1/L1_Nightly_Internal/build/source/rtSafe/resources.h (460) - Cuda Error in loadKernel: 3 (initialization error)
[TensorRT] ERROR: INVALID_STATE: std::exception
[TensorRT] ERROR: INVALID_CONFIG: Deserialize the cuda engine failed.
  ...
RuntimeError: Unable to load the engine file

Here’s a pastebin link for the full Pyinstaller log + error output during inference.

This is the spec file used to create the binary with Pyinstaller.

FYI, I’m working with a modified version of this repo.

What does work:

  • building the TRT engine file and running inference without Pyinstaller

What does not work:

  • building the TRT engine file, adding as data to the Pyinstaller spec file + bundling, deserializing the TRT engine file while running the binary

  • same as above but adding excludes=['pycuda'] to Pyinstaller spec file as recommended by rahul_thai_valappil in this StackOverflow post – what I did not do was “[copy] entire Pycuda folder from python packages” as I was not sure how this should be done…but that seems to not have mattered because I get the same error as above…edit: this point is not relevant since pycuda is not installed

  • adding the .onnx file as data to the spec file, instead of the .trt files, and having the binarized package build the engine files before inference – I get a different error when trying this approach (see below)

Error when having the binary build the TRT engine file:

RuntimeError: Driver error: 

this occurs during tensorrt.Builder(...).build_engine(...)

Hi,

We cannot open the Pastebin link you shared, including the log and the spec file.
Could you help to check it?

In general, Driver error is related to the incompatibility between OS and library.
Could you share more about your environment? Do you set up the system with JetPack 4.5.1? Any manual installation?

More, could you try to deserialize the engine file with trtexec first?
This can help us narrow down where the issue comes from.

Thanks.

Here’s the Pyinstaller spec:
fastMOT.spec (32.7 KB)

Pyinstaller bundle run:
Pyinstaller bundle.txt (36.1 KB)

Pyinstaller binary execution:

./dist/fastMOT --mode extraction --input_uri ~/test_bundle/clips/2021-06-03_17-02-28_001F55121D18.mp4
[TensorRT] VERBOSE: Registered plugin creator - ::GridAnchor_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::NMS_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::Reorg_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::Region_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::Clip_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::LReLU_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::PriorBox_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::Normalize_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::RPROI_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::BatchedNMS_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::FlattenConcat_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::CropAndResize version 1
[TensorRT] VERBOSE: Registered plugin creator - ::DetectionLayer_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::Proposal version 1
[TensorRT] VERBOSE: Registered plugin creator - ::ProposalLayer_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::PyramidROIAlign_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::ResizeNearest_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::Split version 1
[TensorRT] VERBOSE: Registered plugin creator - ::SpecialSlice_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::InstanceNormalization_TRT version 1
2021-08-26 15:32:49 [    INFO] Processing /home/iap/test_bundle/clips/2021-06-03_17-02-28_001F55121D18.mp4
Opening in BLOCKING MODE
Opening in BLOCKING MODE 
NvMMLiteOpen : Block : BlockType = 261 
NVMEDIA: Reading vendor.tegra.display-size : status: 6 
NvMMLiteBlockCreate : Block : BlockType = 261 
[ WARN:0] global /tmp/opencv_install412/opencv-4.5.0/modules/videoio/src/cap_gstreamer.cpp (898) open OpenCV | GStreamer warning: unable to query duration of stream
[ WARN:0] global /tmp/opencv_install412/opencv-4.5.0/modules/videoio/src/cap_gstreamer.cpp (935) open OpenCV | GStreamer warning: Cannot query video position: status=1, value=1, duration=-1
2021-08-26 15:32:50 [    INFO] 1280x720 stream @ 10 FPS
2021-08-26 15:32:50 [    INFO] Loading detector model...
[TensorRT] ERROR: /home/jenkins/workspace/TensorRT/helpers/rel-7.1/L1_Nightly_Internal/build/source/rtSafe/resources.h (460) - Cuda Error in loadKernel: 3 (initialization error)
[TensorRT] ERROR: INVALID_STATE: std::exception
[TensorRT] ERROR: INVALID_CONFIG: Deserialize the cuda engine failed.
Exception ignored in: <bound method TRTInference.__del__ of <fastmot.utils.inference.TRTInference object at 0x7f3e4e3e80>>
Traceback (most recent call last):
  File "fastmot/utils/inference.py", line 101, in __del__
AttributeError: 'NoneType' object has no attribute '__del__'
terminate called without an active exception
Aborted (core dumped)

Could you share more about your environment? Do you set up the system with JetPack 4.5.1? Any manual installation?

This is a base install with JetPack 4.5.1. In addition, this install script was run so the FastMOT library can be used. Let me know if there’s anything else specific you’d like to know about the environment.

Just to reiterate, there’s no issue with creating and running the tensorrt engine on the Jetson per sey – rather, I can’t build a binary with Pyinstaller that is able to deserialize the very same engine file. Here’s the trtexec output nonetheless:

/usr/src/tensorrt/bin/trtexec --loadEngine=models/yolov4_crowdhuman.trt --plugins=plugins/libyolo_layer.so
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=models/yolov4_crowdhuman.trt --plugins=plugins/libyolo_layer.so
[08/26/2021-15:09:47] [I] === Model Options ===
[08/26/2021-15:09:47] [I] Format: *
[08/26/2021-15:09:47] [I] Model: 
[08/26/2021-15:09:47] [I] Output:
[08/26/2021-15:09:47] [I] === Build Options ===
[08/26/2021-15:09:47] [I] Max batch: 1
[08/26/2021-15:09:47] [I] Workspace: 16 MB
[08/26/2021-15:09:47] [I] minTiming: 1
[08/26/2021-15:09:47] [I] avgTiming: 8
[08/26/2021-15:09:47] [I] Precision: FP32
[08/26/2021-15:09:47] [I] Calibration: 
[08/26/2021-15:09:47] [I] Safe mode: Disabled
[08/26/2021-15:09:47] [I] Save engine: 
[08/26/2021-15:09:47] [I] Load engine: models/yolov4_crowdhuman.trt
[08/26/2021-15:09:47] [I] Builder Cache: Enabled
[08/26/2021-15:09:47] [I] NVTX verbosity: 0
[08/26/2021-15:09:47] [I] Inputs format: fp32:CHW
[08/26/2021-15:09:47] [I] Outputs format: fp32:CHW
[08/26/2021-15:09:47] [I] Input build shapes: model
[08/26/2021-15:09:47] [I] Input calibration shapes: model
[08/26/2021-15:09:47] [I] === System Options ===
[08/26/2021-15:09:47] [I] Device: 0
[08/26/2021-15:09:47] [I] DLACore: 
[08/26/2021-15:09:47] [I] Plugins: plugins/libyolo_layer.so
[08/26/2021-15:09:47] [I] === Inference Options ===
[08/26/2021-15:09:47] [I] Batch: 1
[08/26/2021-15:09:47] [I] Input inference shapes: model
[08/26/2021-15:09:47] [I] Iterations: 10
[08/26/2021-15:09:47] [I] Duration: 3s (+ 200ms warm up)
[08/26/2021-15:09:47] [I] Sleep time: 0ms
[08/26/2021-15:09:47] [I] Streams: 1
[08/26/2021-15:09:47] [I] ExposeDMA: Disabled
[08/26/2021-15:09:47] [I] Spin-wait: Disabled
[08/26/2021-15:09:47] [I] Multithreading: Disabled
[08/26/2021-15:09:47] [I] CUDA Graph: Disabled
[08/26/2021-15:09:47] [I] Skip inference: Disabled
[08/26/2021-15:09:47] [I] Inputs:
[08/26/2021-15:09:47] [I] === Reporting Options ===
[08/26/2021-15:09:47] [I] Verbose: Disabled
[08/26/2021-15:09:47] [I] Averages: 10 inferences
[08/26/2021-15:09:47] [I] Percentile: 99
[08/26/2021-15:09:47] [I] Dump output: Disabled
[08/26/2021-15:09:47] [I] Profile: Disabled
[08/26/2021-15:09:47] [I] Export timing to JSON file: 
[08/26/2021-15:09:47] [I] Export output to JSON file: 
[08/26/2021-15:09:47] [I] Export profile to JSON file: 
[08/26/2021-15:09:47] [I] 
[08/26/2021-15:09:47] [I] Loading supplied plugin library: plugins/libyolo_layer.so
[08/26/2021-15:09:50] [I] Starting inference threads
[08/26/2021-15:09:54] [I] Warmup completed 6 queries over 200 ms
[08/26/2021-15:09:54] [I] Timing trace has 149 queries over 3.05287 s
[08/26/2021-15:09:54] [I] Trace averages of 10 runs:
[08/26/2021-15:09:54] [I] Average on 10 runs - GPU latency: 20.0567 ms - Host latency: 20.1655 ms (end to end 20.1776 ms, enqueue 5.75976 ms)
[08/26/2021-15:09:54] [I] Average on 10 runs - GPU latency: 20.3497 ms - Host latency: 20.4601 ms (end to end 20.4691 ms, enqueue 5.20462 ms)
[08/26/2021-15:09:54] [I] Average on 10 runs - GPU latency: 20.5195 ms - Host latency: 20.6306 ms (end to end 20.6396 ms, enqueue 4.42437 ms)
[08/26/2021-15:09:54] [I] Average on 10 runs - GPU latency: 20.584 ms - Host latency: 20.6967 ms (end to end 20.7062 ms, enqueue 5.00521 ms)
[08/26/2021-15:09:54] [I] Average on 10 runs - GPU latency: 20.6653 ms - Host latency: 20.7776 ms (end to end 20.7881 ms, enqueue 4.22771 ms)
[08/26/2021-15:09:54] [I] Average on 10 runs - GPU latency: 20.5224 ms - Host latency: 20.6329 ms (end to end 20.6421 ms, enqueue 3.78873 ms)
[08/26/2021-15:09:54] [I] Average on 10 runs - GPU latency: 20.3759 ms - Host latency: 20.4856 ms (end to end 20.4967 ms, enqueue 3.84323 ms)
[08/26/2021-15:09:54] [I] Average on 10 runs - GPU latency: 20.2148 ms - Host latency: 20.3242 ms (end to end 20.3331 ms, enqueue 3.32419 ms)
[08/26/2021-15:09:54] [I] Average on 10 runs - GPU latency: 20.2372 ms - Host latency: 20.3476 ms (end to end 20.3614 ms, enqueue 3.4265 ms)
[08/26/2021-15:09:54] [I] Average on 10 runs - GPU latency: 20.3995 ms - Host latency: 20.5098 ms (end to end 20.5219 ms, enqueue 3.253 ms)
[08/26/2021-15:09:54] [I] Average on 10 runs - GPU latency: 20.475 ms - Host latency: 20.5853 ms (end to end 20.5958 ms, enqueue 3.17559 ms)
[08/26/2021-15:09:54] [I] Average on 10 runs - GPU latency: 20.3979 ms - Host latency: 20.5093 ms (end to end 20.5164 ms, enqueue 3.07642 ms)
[08/26/2021-15:09:54] [I] Average on 10 runs - GPU latency: 20.2865 ms - Host latency: 20.3953 ms (end to end 20.404 ms, enqueue 3.17839 ms)
[08/26/2021-15:09:54] [I] Average on 10 runs - GPU latency: 20.294 ms - Host latency: 20.4048 ms (end to end 20.4142 ms, enqueue 2.98071 ms)
[08/26/2021-15:09:54] [I] Host Latency
[08/26/2021-15:09:54] [I] min: 20.0084 ms (end to end 20.0187 ms)
[08/26/2021-15:09:54] [I] max: 20.8663 ms (end to end 20.8785 ms)
[08/26/2021-15:09:54] [I] mean: 20.4788 ms (end to end 20.489 ms)
[08/26/2021-15:09:54] [I] median: 20.49 ms (end to end 20.4949 ms)
[08/26/2021-15:09:54] [I] percentile: 20.8638 ms at 99% (end to end 20.8702 ms at 99%)
[08/26/2021-15:09:54] [I] throughput: 48.8066 qps
[08/26/2021-15:09:54] [I] walltime: 3.05287 s
[08/26/2021-15:09:54] [I] Enqueue Time
[08/26/2021-15:09:54] [I] min: 2.57812 ms
[08/26/2021-15:09:54] [I] max: 5.99561 ms
[08/26/2021-15:09:54] [I] median: 3.43164 ms
[08/26/2021-15:09:54] [I] GPU Compute
[08/26/2021-15:09:54] [I] min: 19.9009 ms
[08/26/2021-15:09:54] [I] max: 20.7534 ms
[08/26/2021-15:09:54] [I] mean: 20.3684 ms
[08/26/2021-15:09:54] [I] median: 20.3796 ms
[08/26/2021-15:09:54] [I] percentile: 20.75 ms at 99%
[08/26/2021-15:09:54] [I] total compute time: 3.03489 s
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=models/yolov4_crowdhuman.trt --plugins=plugins/libyolo_layer.so

FYI, here’s the execution of FastMOT on the same system but using Python instead of the Pyinstaller binary:

python app.py --mode extraction --input_uri ~/test_bundle/clips/2021-06-03_17-02-28_001F55121D18.mp4 
[TensorRT] VERBOSE: Registered plugin creator - ::GridAnchor_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::NMS_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::Reorg_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::Region_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::Clip_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::LReLU_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::PriorBox_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::Normalize_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::RPROI_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::BatchedNMS_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::FlattenConcat_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::CropAndResize version 1
[TensorRT] VERBOSE: Registered plugin creator - ::DetectionLayer_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::Proposal version 1
[TensorRT] VERBOSE: Registered plugin creator - ::ProposalLayer_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::PyramidROIAlign_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::ResizeNearest_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::Split version 1
[TensorRT] VERBOSE: Registered plugin creator - ::SpecialSlice_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::InstanceNormalization_TRT version 1
2021-08-26 15:43:48 [    INFO] Processing /home/iap/test_bundle/clips/2021-06-03_17-02-28_001F55121D18.mp4
Opening in BLOCKING MODE
Opening in BLOCKING MODE 
NvMMLiteOpen : Block : BlockType = 261 
NVMEDIA: Reading vendor.tegra.display-size : status: 6 
NvMMLiteBlockCreate : Block : BlockType = 261 
[ WARN:0] global /tmp/opencv_install412/opencv-4.5.0/modules/videoio/src/cap_gstreamer.cpp (898) open OpenCV | GStreamer warning: unable to query duration of stream
[ WARN:0] global /tmp/opencv_install412/opencv-4.5.0/modules/videoio/src/cap_gstreamer.cpp (935) open OpenCV | GStreamer warning: Cannot query video position: status=1, value=1, duration=-1
2021-08-26 15:43:49 [    INFO] 1280x720 stream @ 10 FPS
2021-08-26 15:43:49 [    INFO] Loading detector model...
[TensorRT] VERBOSE: Deserialize required 2279651 microseconds.
2021-08-26 15:43:53 [    INFO] Loading feature extractor model...
[TensorRT] VERBOSE: Deserialize required 26136 microseconds.
2021-08-26 15:44:40 [    INFO] Found:        person   1 at ( 677, 356)
2021-08-26 15:44:40 [    INFO] Found:        person   2 at ( 612, 345)
2021-08-26 15:44:40 [    INFO] Found:        person   3 at ( 963, 242)
2021-08-26 15:44:40 [    INFO] Found:        person   4 at ( 919, 232)
2021-08-26 15:44:40 [    INFO] Found:        person   5 at ( 804, 208)
2021-08-26 15:44:46 [    INFO] Found:        person   7 at ( 135, 464)
2021-08-26 15:44:48 [    INFO] Average FPS: 2

@AastaLLL in case it helps, I’m attaching the strace output when running the binary. The full strace is 107MB and, therefore, too large to upload so I’m instead uploading a grep for cuda with 3 lines of context…

strace_cuda_grep_context.log (1.3 MB)

The inability to deserialize the properly/externally built TRT engine, and the inability to build the TRT engine inside the binary, tell me that the bundle some how does not reference cuda correctly – but the strace output, as far as I understand, contradicts that assumption.

As demonstrated, there’s nothing wrong with the tooling on the Jetson…the same code runs as expected prior to bundling with Pyinstaller.

Hi,

This seems related to the pyinstaller use case.

Could you try the workaround mentioned in the above link to see if it works?

Thanks.

@AastaLLL
I actually linked this very same StackOverflow post in my original post.

Two important things to note relating to this:

  1. The author of the StackOverflow post said making the TRT engine file from scratch prior to compilation fixed their issue…in my case, neither building the TRT engine file ahead of time nor building it inside the binary works (though the two approaches yield different errors). You can re-read my original post to see the description of the errors but I’ll reiterate them once again for convenience below.

  2. The only apparent solution in that StackOverflow post is someone commenting that pycuda should be excluded when building the bundle and copied into it…this isn’t relevant as I’m not using pycuda at all.

The two different errors mentioned above are as follows:

  • When trying to deserialize the engine built outside of the bundle:
[TensorRT] ERROR: /home/jenkins/workspace/TensorRT/helpers/rel-7.1/L1_Nightly_Internal/build/source/rtSafe/resources.h (460) - Cuda Error in loadKernel: 3 (initialization error)
[TensorRT] ERROR: INVALID_STATE: std::exception
[TensorRT] ERROR: INVALID_CONFIG: Deserialize the cuda engine failed.
  • When trying to build the TRT engine file inside the binary:
RuntimeError: Driver error: 

Hi,

We need to reproduce this internally for further suggestions.
Would you mind sharing detailed steps/sources to reproduce this issue in a clean environment?

Thanks.