I do not get any performance improvement after using TensorRT provider for object detection model

Describe the bug
I downloaded the ssd_mobilenet_v2_320x320_coco17_tpu-8 from the TensorFlow 2 model zoo and converted the model using tf2onnx from onnx GitHub, then I Infer shapes in the model using symbolic_shape_infer.py script from the onnxruntime GitHub.
Running the 2 models for comparison the first onnx model with CUDAExecutionProvider and the second inferred one with both CUDAExecutionProvider and TensorrtExecutionProvider I can not remark any improvement in performance (FPS) between the two models. The output indicates Unsupported ONNX data type: UINT8 (2) and throws many errors about the execution of multiClassNonMaxSuppression using TensorrtExecutionProvider, see screenshot below.
However, the model runs correctly but with 1.15 FPS
Is there another way to optimize the model to run faster on Jetson Nano
Any help is very appreciated,
Thanks in advance

System information
Jetson Nano 4GB
Jetpack: 4.6
Linux Ubuntu 18.04).06 LTS
ONNXRuntime-GPU installed from source
ONNX Runtime version: 1.10.0
Python version: 3.6
CUDA/cuDNN version: 10.2.3 / 8.2.1.32
GPU model and memory: Tegra
TensorRT 8.0.1.6

To Reproduce
convert the downloaded model from saved model to onnx using the command below:
python3 -m tf2onnx.convert --saved-model ssd_mobilenet_v2_320x320_coco17_tpu-8/saved_model --output ssdmobilenetv2_320.onnx
install onnxruntime
wget https://nvidia.box.com/shared/static/jy7nqva7l88mq9i8bw3g3sklzf4kccn2.whl -O onnxruntime_gpu-1.11.0-cp36-cp36m-linux_aarch64.whl
Install pip wheel
sudo pip3 install onnxruntime_gpu-1.11.0-cp36-cp36m-linux_aarch64.whl
Install Sympy
pip3 install sympy
Infer shapes in the model by running the
python3 symbolic_shape_infer.py --input ssdmobilenetv2_320.onnx --output ssdmobilenetv2_320new.onnx –auto_merge

Expected behavior
Expect an improvement in performance (FPS) when running the model with TensorrtExecutionProvider

Screenshot

Hi,

First, you can maximize the Nano performance with the following commands:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Since you already have the ONNX model, would you mind also testing it with trtexec binary?

$ /usr/src/tensorrt/bin/trtexec --onnx=[your/model]
$ /usr/src/tensorrt/bin/trtexec --onnx=[your/model] --fp16

Thanks.

Hi Aastalll
Thanks for your help
I am traveling, I will try it tomorrow and post the results
Thanks again

Hi Aastalll,
Unfortunately, the execution of the
/usr/src/tensorrt/bin/trtexec --onnx=[your/model] --fp16
returns the following error:

/usr/src/tensorrt/bin/trtexec --onnx=/home/kamal/Desktop/Desktopnanoproject/ONNXRuntime/ssdmobilenetv2_320new.onnx --fp16
&&&& RUNNING TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --onnx=/home/kamal/Desktop/Desktopnanoproject/ONNXRuntime/ssdmobilenetv2_320new.onnx --fp16
[06/21/2022-10:32:30] [I] === Model Options ===
[06/21/2022-10:32:30] [I] Format: ONNX
[06/21/2022-10:32:30] [I] Model: /home/kamal/Desktop/Desktopnanoproject/ONNXRuntime/ssdmobilenetv2_320new.onnx
[06/21/2022-10:32:30] [I] Output:
[06/21/2022-10:32:30] [I] === Build Options ===
[06/21/2022-10:32:30] [I] Max batch: explicit
[06/21/2022-10:32:30] [I] Workspace: 16 MiB
[06/21/2022-10:32:30] [I] minTiming: 1
[06/21/2022-10:32:30] [I] avgTiming: 8
[06/21/2022-10:32:30] [I] Precision: FP32+FP16
[06/21/2022-10:32:30] [I] Calibration: 
[06/21/2022-10:32:30] [I] Refit: Disabled
[06/21/2022-10:32:30] [I] Sparsity: Disabled
[06/21/2022-10:32:30] [I] Safe mode: Disabled
[06/21/2022-10:32:30] [I] Restricted mode: Disabled
[06/21/2022-10:32:30] [I] Save engine: 
[06/21/2022-10:32:30] [I] Load engine: 
[06/21/2022-10:32:30] [I] NVTX verbosity: 0
[06/21/2022-10:32:30] [I] Tactic sources: Using default tactic sources
[06/21/2022-10:32:30] [I] timingCacheMode: local
[06/21/2022-10:32:30] [I] timingCacheFile: 
[06/21/2022-10:32:30] [I] Input(s)s format: fp32:CHW
[06/21/2022-10:32:30] [I] Output(s)s format: fp32:CHW
[06/21/2022-10:32:30] [I] Input build shapes: model
[06/21/2022-10:32:30] [I] Input calibration shapes: model
[06/21/2022-10:32:30] [I] === System Options ===
[06/21/2022-10:32:30] [I] Device: 0
[06/21/2022-10:32:30] [I] DLACore: 
[06/21/2022-10:32:30] [I] Plugins:
[06/21/2022-10:32:30] [I] === Inference Options ===
[06/21/2022-10:32:30] [I] Batch: Explicit
[06/21/2022-10:32:30] [I] Input inference shapes: model
[06/21/2022-10:32:30] [I] Iterations: 10
[06/21/2022-10:32:30] [I] Duration: 3s (+ 200ms warm up)
[06/21/2022-10:32:30] [I] Sleep time: 0ms
[06/21/2022-10:32:30] [I] Streams: 1
[06/21/2022-10:32:30] [I] ExposeDMA: Disabled
[06/21/2022-10:32:30] [I] Data transfers: Enabled
[06/21/2022-10:32:30] [I] Spin-wait: Disabled
[06/21/2022-10:32:30] [I] Multithreading: Disabled
[06/21/2022-10:32:30] [I] CUDA Graph: Disabled
[06/21/2022-10:32:30] [I] Separate profiling: Disabled
[06/21/2022-10:32:30] [I] Time Deserialize: Disabled
[06/21/2022-10:32:30] [I] Time Refit: Disabled
[06/21/2022-10:32:30] [I] Skip inference: Disabled
[06/21/2022-10:32:30] [I] Inputs:
[06/21/2022-10:32:30] [I] === Reporting Options ===
[06/21/2022-10:32:30] [I] Verbose: Disabled
[06/21/2022-10:32:30] [I] Averages: 10 inferences
[06/21/2022-10:32:30] [I] Percentile: 99
[06/21/2022-10:32:30] [I] Dump refittable layers:Disabled
[06/21/2022-10:32:30] [I] Dump output: Disabled
[06/21/2022-10:32:30] [I] Profile: Disabled
[06/21/2022-10:32:30] [I] Export timing to JSON file: 
[06/21/2022-10:32:30] [I] Export output to JSON file: 
[06/21/2022-10:32:30] [I] Export profile to JSON file: 
[06/21/2022-10:32:30] [I] 
[06/21/2022-10:32:30] [I] === Device Information ===
[06/21/2022-10:32:30] [I] Selected Device: NVIDIA Tegra X1
[06/21/2022-10:32:30] [I] Compute Capability: 5.3
[06/21/2022-10:32:30] [I] SMs: 1
[06/21/2022-10:32:30] [I] Compute Clock Rate: 0.9216 GHz
[06/21/2022-10:32:30] [I] Device Global Memory: 3964 MiB
[06/21/2022-10:32:30] [I] Shared Memory per SM: 64 KiB
[06/21/2022-10:32:30] [I] Memory Bus Width: 64 bits (ECC disabled)
[06/21/2022-10:32:30] [I] Memory Clock Rate: 0.01275 GHz
[06/21/2022-10:32:30] [I] 
[06/21/2022-10:32:30] [I] TensorRT version: 8001
[06/21/2022-10:32:33] [I] [TRT] [MemUsageChange] Init CUDA: CPU +203, GPU +0, now: CPU 221, GPU 3465 (MiB)
[06/21/2022-10:32:33] [I] Start parsing network model
[06/21/2022-10:32:34] [I] [TRT] ----------------------------------------------------------------
[06/21/2022-10:32:34] [I] [TRT] Input filename:   /home/kamal/Desktop/Desktopnanoproject/ONNXRuntime/ssdmobilenetv2_320new.onnx
[06/21/2022-10:32:34] [I] [TRT] ONNX IR version:  0.0.7
[06/21/2022-10:32:34] [I] [TRT] Opset version:    13
[06/21/2022-10:32:34] [I] [TRT] Producer name:    tf2onnx
[06/21/2022-10:32:34] [I] [TRT] Producer version: 1.10.0 e9b6cb
[06/21/2022-10:32:34] [I] [TRT] Domain:           
[06/21/2022-10:32:34] [I] [TRT] Model version:    0
[06/21/2022-10:32:34] [I] [TRT] Doc string:       
[06/21/2022-10:32:34] [I] [TRT] ----------------------------------------------------------------
Unsupported ONNX data type: UINT8 (2)
[06/21/2022-10:32:34] [E] [TRT] ModelImporter.cpp:726: ERROR: input_tensor:230 In function importInput:
[8] Assertion failed: convertDtype(onnxDtype.elem_type(), &trtDtype) && "Failed to convert ONNX date type to TensorRT data type."
[06/21/2022-10:32:34] [E] Failed to parse onnx file
[06/21/2022-10:32:34] [I] Finish parsing network model
[06/21/2022-10:32:34] [E] Parsing model failed
[06/21/2022-10:32:34] [E] Engine creation failed
[06/21/2022-10:32:34] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --onnx=/home/kamal/Desktop/Desktopnanoproject/ONNXRuntime/ssdmobilenetv2_320new.onnx --fp16

when executing the command without --fp16option I get the same error Unsupported ONNX data type: UINT8 (2), is there a solution to his problem
Any help is very appreciated
Kamal

Hi Aastalll,
I tried to change the dtype input data from uint8 to float using the script below:

import onnx_graphsurgeon as gs
import onnx
import numpy as np

print ("Patching the ONNX model.. ")

graph = gs.import_onnx(onnx.load("ssdmobilenetv2_320new.onnx"))
for inp in graph.inputs:
    inp.dtype = np.float32

onnx.save(gs.export_onnx(graph),"updated_model.onnx")

print ("Check ONNX model using checker function and see if it passes...")
model = onnx.load("updated_model.onnx")
onnx.checker.check_model(model)
print('The model is checked!') 

Using the new generated model I tried converting the model using trtexec /usr/src/tensorrt/bin/trtexec --onnx=/home/kamal/Desktop/Desktopnanoproject/ONNXRuntime/updated_model.onnx --fp16
The UINT8 error disappears but it throws another error, about Assertion failed: (inputs.at(1).is_weights()) && "This version of TensorRT only supports input K as an initializer."
See below:

/usr/src/tensorrt/bin/trtexec --onnx=/home/kamal/Desktop/Desktopnanoproject/ONNXRuntime/updated_model.onnx --fp16
&&&& RUNNING TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --onnx=/home/kamal/Desktop/Desktopnanoproject/ONNXRuntime/updated_model.onnx --fp16
[06/21/2022-11:19:24] [I] === Model Options ===
[06/21/2022-11:19:24] [I] Format: ONNX
[06/21/2022-11:19:24] [I] Model: /home/kamal/Desktop/Desktopnanoproject/ONNXRuntime/updated_model.onnx
[06/21/2022-11:19:24] [I] Output:
[06/21/2022-11:19:24] [I] === Build Options ===
[06/21/2022-11:19:24] [I] Max batch: explicit
[06/21/2022-11:19:24] [I] Workspace: 16 MiB
[06/21/2022-11:19:24] [I] minTiming: 1
[06/21/2022-11:19:24] [I] avgTiming: 8
[06/21/2022-11:19:24] [I] Precision: FP32+FP16
[06/21/2022-11:19:24] [I] Calibration: 
[06/21/2022-11:19:24] [I] Refit: Disabled
[06/21/2022-11:19:24] [I] Sparsity: Disabled
[06/21/2022-11:19:24] [I] Safe mode: Disabled
[06/21/2022-11:19:24] [I] Restricted mode: Disabled
[06/21/2022-11:19:24] [I] Save engine: 
[06/21/2022-11:19:24] [I] Load engine: 
[06/21/2022-11:19:24] [I] NVTX verbosity: 0
[06/21/2022-11:19:24] [I] Tactic sources: Using default tactic sources
[06/21/2022-11:19:24] [I] timingCacheMode: local
[06/21/2022-11:19:24] [I] timingCacheFile: 
[06/21/2022-11:19:24] [I] Input(s)s format: fp32:CHW
[06/21/2022-11:19:24] [I] Output(s)s format: fp32:CHW
[06/21/2022-11:19:24] [I] Input build shapes: model
[06/21/2022-11:19:24] [I] Input calibration shapes: model
[06/21/2022-11:19:24] [I] === System Options ===
[06/21/2022-11:19:24] [I] Device: 0
[06/21/2022-11:19:24] [I] DLACore: 
[06/21/2022-11:19:24] [I] Plugins:
[06/21/2022-11:19:24] [I] === Inference Options ===
[06/21/2022-11:19:24] [I] Batch: Explicit
[06/21/2022-11:19:24] [I] Input inference shapes: model
[06/21/2022-11:19:24] [I] Iterations: 10
[06/21/2022-11:19:24] [I] Duration: 3s (+ 200ms warm up)
[06/21/2022-11:19:24] [I] Sleep time: 0ms
[06/21/2022-11:19:24] [I] Streams: 1
[06/21/2022-11:19:24] [I] ExposeDMA: Disabled
[06/21/2022-11:19:24] [I] Data transfers: Enabled
[06/21/2022-11:19:24] [I] Spin-wait: Disabled
[06/21/2022-11:19:24] [I] Multithreading: Disabled
[06/21/2022-11:19:24] [I] CUDA Graph: Disabled
[06/21/2022-11:19:24] [I] Separate profiling: Disabled
[06/21/2022-11:19:24] [I] Time Deserialize: Disabled
[06/21/2022-11:19:24] [I] Time Refit: Disabled
[06/21/2022-11:19:24] [I] Skip inference: Disabled
[06/21/2022-11:19:24] [I] Inputs:
[06/21/2022-11:19:24] [I] === Reporting Options ===
[06/21/2022-11:19:24] [I] Verbose: Disabled
[06/21/2022-11:19:24] [I] Averages: 10 inferences
[06/21/2022-11:19:24] [I] Percentile: 99
[06/21/2022-11:19:24] [I] Dump refittable layers:Disabled
[06/21/2022-11:19:24] [I] Dump output: Disabled
[06/21/2022-11:19:24] [I] Profile: Disabled
[06/21/2022-11:19:24] [I] Export timing to JSON file: 
[06/21/2022-11:19:24] [I] Export output to JSON file: 
[06/21/2022-11:19:24] [I] Export profile to JSON file: 
[06/21/2022-11:19:24] [I] 
[06/21/2022-11:19:24] [I] === Device Information ===
[06/21/2022-11:19:24] [I] Selected Device: NVIDIA Tegra X1
[06/21/2022-11:19:24] [I] Compute Capability: 5.3
[06/21/2022-11:19:24] [I] SMs: 1
[06/21/2022-11:19:24] [I] Compute Clock Rate: 0.9216 GHz
[06/21/2022-11:19:24] [I] Device Global Memory: 3964 MiB
[06/21/2022-11:19:24] [I] Shared Memory per SM: 64 KiB
[06/21/2022-11:19:24] [I] Memory Bus Width: 64 bits (ECC disabled)
[06/21/2022-11:19:24] [I] Memory Clock Rate: 0.01275 GHz
[06/21/2022-11:19:24] [I] 
[06/21/2022-11:19:24] [I] TensorRT version: 8001
[06/21/2022-11:19:25] [I] [TRT] [MemUsageChange] Init CUDA: CPU +203, GPU +0, now: CPU 221, GPU 3787 (MiB)
[06/21/2022-11:19:25] [I] Start parsing network model
[06/21/2022-11:19:25] [I] [TRT] ----------------------------------------------------------------
[06/21/2022-11:19:25] [I] [TRT] Input filename:   /home/kamal/Desktop/Desktopnanoproject/ONNXRuntime/updated_model.onnx
[06/21/2022-11:19:25] [I] [TRT] ONNX IR version:  0.0.8
[06/21/2022-11:19:25] [I] [TRT] Opset version:    13
[06/21/2022-11:19:25] [I] [TRT] Producer name:    tf2onnx
[06/21/2022-11:19:25] [I] [TRT] Producer version: 1.10.0 e9b6cb
[06/21/2022-11:19:25] [I] [TRT] Domain:           
[06/21/2022-11:19:25] [I] [TRT] Model version:    0
[06/21/2022-11:19:25] [I] [TRT] Doc string:       
[06/21/2022-11:19:25] [I] [TRT] ----------------------------------------------------------------
[06/21/2022-11:19:25] [W] [TRT] onnx2trt_utils.cpp:364: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[06/21/2022-11:19:25] [W] [TRT] onnx2trt_utils.cpp:390: One or more weights outside the range of INT32 was clamped
[06/21/2022-11:20:11] [E] [TRT] ModelImporter.cpp:720: While parsing node number 2922 [TopK -> "StatefulPartitionedCall/Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/SortByField/TopKV2:0"]:
[06/21/2022-11:20:11] [E] [TRT] ModelImporter.cpp:721: --- Begin node ---
[06/21/2022-11:20:11] [E] [TRT] ModelImporter.cpp:722: input: "StatefulPartitionedCall/Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/Concatenate/concat_1:0"
input: "Unsqueeze__4282:0"
output: "StatefulPartitionedCall/Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/SortByField/TopKV2:0"
output: "StatefulPartitionedCall/Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/SortByField/TopKV2:1"
name: "StatefulPartitionedCall/Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/SortByField/TopKV2"
op_type: "TopK"
attribute {
  name: "sorted"
  i: 1
  type: INT
}

[06/21/2022-11:20:11] [E] [TRT] ModelImporter.cpp:723: --- End node ---
[06/21/2022-11:20:11] [E] [TRT] ModelImporter.cpp:726: ERROR: builtin_op_importers.cpp:4292 In function importTopK:
[8] Assertion failed: (inputs.at(1).is_weights()) && "This version of TensorRT only supports input K as an initializer."
[06/21/2022-11:20:11] [E] Failed to parse onnx file
[06/21/2022-11:20:11] [I] Finish parsing network model
[06/21/2022-11:20:11] [E] Parsing model failed
[06/21/2022-11:20:11] [E] Engine creation failed
[06/21/2022-11:20:11] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --onnx=/home/kamal/Desktop/Desktopnanoproject/ONNXRuntime/updated_model.onnx --fp16

Any help is very appreciated
Thanks in advance
Kamal