I do not get any performance improvement after using TensorRT provider for object detection model

KamalLAGH · June 9, 2022, 8:27pm

Describe the bug
I downloaded the ssd_mobilenet_v2_320x320_coco17_tpu-8 from the TensorFlow 2 model zoo and converted the model using tf2onnx from onnx GitHub, then I Infer shapes in the model using symbolic_shape_infer.py script from the onnxruntime GitHub.
Running the 2 models for comparison the first onnx model with CUDAExecutionProvider and the second inferred one with both CUDAExecutionProvider and TensorrtExecutionProvider I can not remark any improvement in performance (FPS) between the two models. The output indicates Unsupported ONNX data type: UINT8 (2) and throws many errors about the execution of multiClassNonMaxSuppression using TensorrtExecutionProvider, see screenshot below.
However, the model runs correctly but with 1.15 FPS
Is there another way to optimize the model to run faster on Jetson Nano
Any help is very appreciated,
Thanks in advance

System information
Jetson Nano 4GB
Jetpack: 4.6
Linux Ubuntu 18.04).06 LTS
ONNXRuntime-GPU installed from source
ONNX Runtime version: 1.10.0
Python version: 3.6
CUDA/cuDNN version: 10.2.3 / 8.2.1.32
GPU model and memory: Tegra
TensorRT 8.0.1.6

To Reproduce
convert the downloaded model from saved model to onnx using the command below:
python3 -m tf2onnx.convert --saved-model ssd_mobilenet_v2_320x320_coco17_tpu-8/saved_model --output ssdmobilenetv2_320.onnx
install onnxruntime
wget https://nvidia.box.com/shared/static/jy7nqva7l88mq9i8bw3g3sklzf4kccn2.whl -O onnxruntime_gpu-1.11.0-cp36-cp36m-linux_aarch64.whl
Install pip wheel
sudo pip3 install onnxruntime_gpu-1.11.0-cp36-cp36m-linux_aarch64.whl
Install Sympy
pip3 install sympy
Infer shapes in the model by running the
python3 symbolic_shape_infer.py --input ssdmobilenetv2_320.onnx --output ssdmobilenetv2_320new.onnx –auto_merge

Expected behavior
Expect an improvement in performance (FPS) when running the model with TensorrtExecutionProvider

Screenshot

AastaLLL · June 13, 2022, 4:15am

Hi,

First, you can maximize the Nano performance with the following commands:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Since you already have the ONNX model, would you mind also testing it with trtexec binary?

$ /usr/src/tensorrt/bin/trtexec --onnx=[your/model]
$ /usr/src/tensorrt/bin/trtexec --onnx=[your/model] --fp16

Thanks.

KamalLAGH · June 13, 2022, 6:33am

Hi Aastalll
Thanks for your help
I am traveling, I will try it tomorrow and post the results
Thanks again

KamalLAGH · June 21, 2022, 9:57am

Hi Aastalll,
Unfortunately, the execution of the
/usr/src/tensorrt/bin/trtexec --onnx=[your/model] --fp16
returns the following error:

/usr/src/tensorrt/bin/trtexec --onnx=/home/kamal/Desktop/Desktopnanoproject/ONNXRuntime/ssdmobilenetv2_320new.onnx --fp16
&&&& RUNNING TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --onnx=/home/kamal/Desktop/Desktopnanoproject/ONNXRuntime/ssdmobilenetv2_320new.onnx --fp16
[06/21/2022-10:32:30] [I] === Model Options ===
[06/21/2022-10:32:30] [I] Format: ONNX
[06/21/2022-10:32:30] [I] Model: /home/kamal/Desktop/Desktopnanoproject/ONNXRuntime/ssdmobilenetv2_320new.onnx
[06/21/2022-10:32:30] [I] Output:
[06/21/2022-10:32:30] [I] === Build Options ===
[06/21/2022-10:32:30] [I] Max batch: explicit
[06/21/2022-10:32:30] [I] Workspace: 16 MiB
[06/21/2022-10:32:30] [I] minTiming: 1
[06/21/2022-10:32:30] [I] avgTiming: 8
[06/21/2022-10:32:30] [I] Precision: FP32+FP16
[06/21/2022-10:32:30] [I] Calibration: 
[06/21/2022-10:32:30] [I] Refit: Disabled
[06/21/2022-10:32:30] [I] Sparsity: Disabled
[06/21/2022-10:32:30] [I] Safe mode: Disabled
[06/21/2022-10:32:30] [I] Restricted mode: Disabled
[06/21/2022-10:32:30] [I] Save engine: 
[06/21/2022-10:32:30] [I] Load engine: 
[06/21/2022-10:32:30] [I] NVTX verbosity: 0
[06/21/2022-10:32:30] [I] Tactic sources: Using default tactic sources
[06/21/2022-10:32:30] [I] timingCacheMode: local
[06/21/2022-10:32:30] [I] timingCacheFile: 
[06/21/2022-10:32:30] [I] Input(s)s format: fp32:CHW
[06/21/2022-10:32:30] [I] Output(s)s format: fp32:CHW
[06/21/2022-10:32:30] [I] Input build shapes: model
[06/21/2022-10:32:30] [I] Input calibration shapes: model
[06/21/2022-10:32:30] [I] === System Options ===
[06/21/2022-10:32:30] [I] Device: 0
[06/21/2022-10:32:30] [I] DLACore: 
[06/21/2022-10:32:30] [I] Plugins:
[06/21/2022-10:32:30] [I] === Inference Options ===
[06/21/2022-10:32:30] [I] Batch: Explicit
[06/21/2022-10:32:30] [I] Input inference shapes: model
[06/21/2022-10:32:30] [I] Iterations: 10
[06/21/2022-10:32:30] [I] Duration: 3s (+ 200ms warm up)
[06/21/2022-10:32:30] [I] Sleep time: 0ms
[06/21/2022-10:32:30] [I] Streams: 1
[06/21/2022-10:32:30] [I] ExposeDMA: Disabled
[06/21/2022-10:32:30] [I] Data transfers: Enabled
[06/21/2022-10:32:30] [I] Spin-wait: Disabled
[06/21/2022-10:32:30] [I] Multithreading: Disabled
[06/21/2022-10:32:30] [I] CUDA Graph: Disabled
[06/21/2022-10:32:30] [I] Separate profiling: Disabled
[06/21/2022-10:32:30] [I] Time Deserialize: Disabled
[06/21/2022-10:32:30] [I] Time Refit: Disabled
[06/21/2022-10:32:30] [I] Skip inference: Disabled
[06/21/2022-10:32:30] [I] Inputs:
[06/21/2022-10:32:30] [I] === Reporting Options ===
[06/21/2022-10:32:30] [I] Verbose: Disabled
[06/21/2022-10:32:30] [I] Averages: 10 inferences
[06/21/2022-10:32:30] [I] Percentile: 99
[06/21/2022-10:32:30] [I] Dump refittable layers:Disabled
[06/21/2022-10:32:30] [I] Dump output: Disabled
[06/21/2022-10:32:30] [I] Profile: Disabled
[06/21/2022-10:32:30] [I] Export timing to JSON file: 
[06/21/2022-10:32:30] [I] Export output to JSON file: 
[06/21/2022-10:32:30] [I] Export profile to JSON file: 
[06/21/2022-10:32:30] [I] 
[06/21/2022-10:32:30] [I] === Device Information ===
[06/21/2022-10:32:30] [I] Selected Device: NVIDIA Tegra X1
[06/21/2022-10:32:30] [I] Compute Capability: 5.3
[06/21/2022-10:32:30] [I] SMs: 1
[06/21/2022-10:32:30] [I] Compute Clock Rate: 0.9216 GHz
[06/21/2022-10:32:30] [I] Device Global Memory: 3964 MiB
[06/21/2022-10:32:30] [I] Shared Memory per SM: 64 KiB
[06/21/2022-10:32:30] [I] Memory Bus Width: 64 bits (ECC disabled)
[06/21/2022-10:32:30] [I] Memory Clock Rate: 0.01275 GHz
[06/21/2022-10:32:30] [I] 
[06/21/2022-10:32:30] [I] TensorRT version: 8001
[06/21/2022-10:32:33] [I] [TRT] [MemUsageChange] Init CUDA: CPU +203, GPU +0, now: CPU 221, GPU 3465 (MiB)
[06/21/2022-10:32:33] [I] Start parsing network model
[06/21/2022-10:32:34] [I] [TRT] ----------------------------------------------------------------
[06/21/2022-10:32:34] [I] [TRT] Input filename:   /home/kamal/Desktop/Desktopnanoproject/ONNXRuntime/ssdmobilenetv2_320new.onnx
[06/21/2022-10:32:34] [I] [TRT] ONNX IR version:  0.0.7
[06/21/2022-10:32:34] [I] [TRT] Opset version:    13
[06/21/2022-10:32:34] [I] [TRT] Producer name:    tf2onnx
[06/21/2022-10:32:34] [I] [TRT] Producer version: 1.10.0 e9b6cb
[06/21/2022-10:32:34] [I] [TRT] Domain:           
[06/21/2022-10:32:34] [I] [TRT] Model version:    0
[06/21/2022-10:32:34] [I] [TRT] Doc string:       
[06/21/2022-10:32:34] [I] [TRT] ----------------------------------------------------------------
Unsupported ONNX data type: UINT8 (2)
[06/21/2022-10:32:34] [E] [TRT] ModelImporter.cpp:726: ERROR: input_tensor:230 In function importInput:
[8] Assertion failed: convertDtype(onnxDtype.elem_type(), &trtDtype) && "Failed to convert ONNX date type to TensorRT data type."
[06/21/2022-10:32:34] [E] Failed to parse onnx file
[06/21/2022-10:32:34] [I] Finish parsing network model
[06/21/2022-10:32:34] [E] Parsing model failed
[06/21/2022-10:32:34] [E] Engine creation failed
[06/21/2022-10:32:34] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --onnx=/home/kamal/Desktop/Desktopnanoproject/ONNXRuntime/ssdmobilenetv2_320new.onnx --fp16

when executing the command without --fp16option I get the same error Unsupported ONNX data type: UINT8 (2), is there a solution to his problem
Any help is very appreciated
Kamal

KamalLAGH · June 21, 2022, 11:35am

Hi Aastalll,
I tried to change the dtype input data from uint8 to float using the script below:

import onnx_graphsurgeon as gs
import onnx
import numpy as np

print ("Patching the ONNX model.. ")

graph = gs.import_onnx(onnx.load("ssdmobilenetv2_320new.onnx"))
for inp in graph.inputs:
    inp.dtype = np.float32

onnx.save(gs.export_onnx(graph),"updated_model.onnx")

print ("Check ONNX model using checker function and see if it passes...")
model = onnx.load("updated_model.onnx")
onnx.checker.check_model(model)
print('The model is checked!')

Using the new generated model I tried converting the model using trtexec /usr/src/tensorrt/bin/trtexec --onnx=/home/kamal/Desktop/Desktopnanoproject/ONNXRuntime/updated_model.onnx --fp16
The UINT8 error disappears but it throws another error, about Assertion failed: (inputs.at(1).is_weights()) && "This version of TensorRT only supports input K as an initializer."
See below:

/usr/src/tensorrt/bin/trtexec --onnx=/home/kamal/Desktop/Desktopnanoproject/ONNXRuntime/updated_model.onnx --fp16
&&&& RUNNING TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --onnx=/home/kamal/Desktop/Desktopnanoproject/ONNXRuntime/updated_model.onnx --fp16
[06/21/2022-11:19:24] [I] === Model Options ===
[06/21/2022-11:19:24] [I] Format: ONNX
[06/21/2022-11:19:24] [I] Model: /home/kamal/Desktop/Desktopnanoproject/ONNXRuntime/updated_model.onnx
[06/21/2022-11:19:24] [I] Output:
[06/21/2022-11:19:24] [I] === Build Options ===
[06/21/2022-11:19:24] [I] Max batch: explicit
[06/21/2022-11:19:24] [I] Workspace: 16 MiB
[06/21/2022-11:19:24] [I] minTiming: 1
[06/21/2022-11:19:24] [I] avgTiming: 8
[06/21/2022-11:19:24] [I] Precision: FP32+FP16
[06/21/2022-11:19:24] [I] Calibration: 
[06/21/2022-11:19:24] [I] Refit: Disabled
[06/21/2022-11:19:24] [I] Sparsity: Disabled
[06/21/2022-11:19:24] [I] Safe mode: Disabled
[06/21/2022-11:19:24] [I] Restricted mode: Disabled
[06/21/2022-11:19:24] [I] Save engine: 
[06/21/2022-11:19:24] [I] Load engine: 
[06/21/2022-11:19:24] [I] NVTX verbosity: 0
[06/21/2022-11:19:24] [I] Tactic sources: Using default tactic sources
[06/21/2022-11:19:24] [I] timingCacheMode: local
[06/21/2022-11:19:24] [I] timingCacheFile: 
[06/21/2022-11:19:24] [I] Input(s)s format: fp32:CHW
[06/21/2022-11:19:24] [I] Output(s)s format: fp32:CHW
[06/21/2022-11:19:24] [I] Input build shapes: model
[06/21/2022-11:19:24] [I] Input calibration shapes: model
[06/21/2022-11:19:24] [I] === System Options ===
[06/21/2022-11:19:24] [I] Device: 0
[06/21/2022-11:19:24] [I] DLACore: 
[06/21/2022-11:19:24] [I] Plugins:
[06/21/2022-11:19:24] [I] === Inference Options ===
[06/21/2022-11:19:24] [I] Batch: Explicit
[06/21/2022-11:19:24] [I] Input inference shapes: model
[06/21/2022-11:19:24] [I] Iterations: 10
[06/21/2022-11:19:24] [I] Duration: 3s (+ 200ms warm up)
[06/21/2022-11:19:24] [I] Sleep time: 0ms
[06/21/2022-11:19:24] [I] Streams: 1
[06/21/2022-11:19:24] [I] ExposeDMA: Disabled
[06/21/2022-11:19:24] [I] Data transfers: Enabled
[06/21/2022-11:19:24] [I] Spin-wait: Disabled
[06/21/2022-11:19:24] [I] Multithreading: Disabled
[06/21/2022-11:19:24] [I] CUDA Graph: Disabled
[06/21/2022-11:19:24] [I] Separate profiling: Disabled
[06/21/2022-11:19:24] [I] Time Deserialize: Disabled
[06/21/2022-11:19:24] [I] Time Refit: Disabled
[06/21/2022-11:19:24] [I] Skip inference: Disabled
[06/21/2022-11:19:24] [I] Inputs:
[06/21/2022-11:19:24] [I] === Reporting Options ===
[06/21/2022-11:19:24] [I] Verbose: Disabled
[06/21/2022-11:19:24] [I] Averages: 10 inferences
[06/21/2022-11:19:24] [I] Percentile: 99
[06/21/2022-11:19:24] [I] Dump refittable layers:Disabled
[06/21/2022-11:19:24] [I] Dump output: Disabled
[06/21/2022-11:19:24] [I] Profile: Disabled
[06/21/2022-11:19:24] [I] Export timing to JSON file: 
[06/21/2022-11:19:24] [I] Export output to JSON file: 
[06/21/2022-11:19:24] [I] Export profile to JSON file: 
[06/21/2022-11:19:24] [I] 
[06/21/2022-11:19:24] [I] === Device Information ===
[06/21/2022-11:19:24] [I] Selected Device: NVIDIA Tegra X1
[06/21/2022-11:19:24] [I] Compute Capability: 5.3
[06/21/2022-11:19:24] [I] SMs: 1
[06/21/2022-11:19:24] [I] Compute Clock Rate: 0.9216 GHz
[06/21/2022-11:19:24] [I] Device Global Memory: 3964 MiB
[06/21/2022-11:19:24] [I] Shared Memory per SM: 64 KiB
[06/21/2022-11:19:24] [I] Memory Bus Width: 64 bits (ECC disabled)
[06/21/2022-11:19:24] [I] Memory Clock Rate: 0.01275 GHz
[06/21/2022-11:19:24] [I] 
[06/21/2022-11:19:24] [I] TensorRT version: 8001
[06/21/2022-11:19:25] [I] [TRT] [MemUsageChange] Init CUDA: CPU +203, GPU +0, now: CPU 221, GPU 3787 (MiB)
[06/21/2022-11:19:25] [I] Start parsing network model
[06/21/2022-11:19:25] [I] [TRT] ----------------------------------------------------------------
[06/21/2022-11:19:25] [I] [TRT] Input filename:   /home/kamal/Desktop/Desktopnanoproject/ONNXRuntime/updated_model.onnx
[06/21/2022-11:19:25] [I] [TRT] ONNX IR version:  0.0.8
[06/21/2022-11:19:25] [I] [TRT] Opset version:    13
[06/21/2022-11:19:25] [I] [TRT] Producer name:    tf2onnx
[06/21/2022-11:19:25] [I] [TRT] Producer version: 1.10.0 e9b6cb
[06/21/2022-11:19:25] [I] [TRT] Domain:           
[06/21/2022-11:19:25] [I] [TRT] Model version:    0
[06/21/2022-11:19:25] [I] [TRT] Doc string:       
[06/21/2022-11:19:25] [I] [TRT] ----------------------------------------------------------------
[06/21/2022-11:19:25] [W] [TRT] onnx2trt_utils.cpp:364: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[06/21/2022-11:19:25] [W] [TRT] onnx2trt_utils.cpp:390: One or more weights outside the range of INT32 was clamped
[06/21/2022-11:20:11] [E] [TRT] ModelImporter.cpp:720: While parsing node number 2922 [TopK -> "StatefulPartitionedCall/Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/SortByField/TopKV2:0"]:
[06/21/2022-11:20:11] [E] [TRT] ModelImporter.cpp:721: --- Begin node ---
[06/21/2022-11:20:11] [E] [TRT] ModelImporter.cpp:722: input: "StatefulPartitionedCall/Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/Concatenate/concat_1:0"
input: "Unsqueeze__4282:0"
output: "StatefulPartitionedCall/Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/SortByField/TopKV2:0"
output: "StatefulPartitionedCall/Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/SortByField/TopKV2:1"
name: "StatefulPartitionedCall/Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/SortByField/TopKV2"
op_type: "TopK"
attribute {
  name: "sorted"
  i: 1
  type: INT
}

[06/21/2022-11:20:11] [E] [TRT] ModelImporter.cpp:723: --- End node ---
[06/21/2022-11:20:11] [E] [TRT] ModelImporter.cpp:726: ERROR: builtin_op_importers.cpp:4292 In function importTopK:
[8] Assertion failed: (inputs.at(1).is_weights()) && "This version of TensorRT only supports input K as an initializer."
[06/21/2022-11:20:11] [E] Failed to parse onnx file
[06/21/2022-11:20:11] [I] Finish parsing network model
[06/21/2022-11:20:11] [E] Parsing model failed
[06/21/2022-11:20:11] [E] Engine creation failed
[06/21/2022-11:20:11] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --onnx=/home/kamal/Desktop/Desktopnanoproject/ONNXRuntime/updated_model.onnx --fp16

Any help is very appreciated
Thanks in advance
Kamal

AastaLLL · June 30, 2022, 5:25am

Hi,

The error is related to some non-supported layers.

Is it possible to upgrade your device to the JetPack 4.6.2 with TensorRT 8.2?
If yes, could you try the below sample that deploys the ssd_mobilenet_v2_320x320_coco17_tpu-8 model with TensorRT?

Thanks.

KamalLAGH · July 12, 2022, 5:18pm

Hi AastaLLL,
Thank you for your reply.
It is correct to build TensorRT engine you have to upgrade your Jetpack to 4.6.1 or 4.6.2 since they come with TensorRT 8.2.1. However, re-exporting the TensorFlow or creating the onnx model can not be done on the Jetson because we can not install TensorFlow 2.5 on the Jetpack 4.6.1 & 4.6.2 . Will Nvidia provides the TensorFlow 2.5 python 3.6 Linux Arm arch for these Jetpack.
https://developer.download.nvidia.com/compute/redist/jp/v461/tensorflow/
Thanks in advance

system · July 26, 2022, 5:19pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
I am trying to convert the ONNX SSD mobilnet v3 model into TensorRT Engine. I am getting the below error Jetson TX2 tensorrt , tensorflow	24	3703	February 17, 2022
ERORR with ONNX2TRT : Unknown embedded device detected Jetson Xavier NX onnx	18	4565	April 27, 2022
Process killed during tensorrt conversion on Jetson orin NX (8 GB) Jetson Orin NX tensorrt	15	729	April 30, 2024
Inference error while using tensorrt engine on jetson nano Jetson Nano tensorrt , nvbugs	23	3623	April 20, 2022
Model onnx trt engine generation process report different results compared between PC and jetson XAVIER NX Jetson Xavier NX tensorrt	19	1021	September 28, 2022
ONNX Model Inference on Jetson Nano - Segmentation fault Jetson Nano tensorrt , jetson-inference	8	1445	October 15, 2021
Assertion Error in buildMemGraph: 0 (mg.nodes[mg.regionIndices[outputRegion]].size == mg.nodes[mg.regionIndices[inputRegion]].size) TensorRT	10	1293	October 12, 2021
Erorr with onnx to trt Jetson Xavier NX tensorrt	8	1243	March 30, 2022
Conversion to tensorRT error . [graphShapeAnalyzer.cpp::throwIfError::1306] Error Code 9 TensorRT jetson-inference	10	4361	May 13, 2022
Onnx to TensorRT mismatch Jetson Orin NX tensorrt , cuda , cudnn , onnx	11	993	January 15, 2024

I do not get any performance improvement after using TensorRT provider for object detection model

Related topics