Describe the bug
I downloaded the ssd_mobilenet_v2_320x320_coco17_tpu-8 from the TensorFlow 2 model zoo and converted the model using tf2onnx from onnx GitHub, then I Infer shapes in the model using symbolic_shape_infer.py script from the onnxruntime GitHub.
Running the 2 models for comparison the first onnx model with CUDAExecutionProvider and the second inferred one with both CUDAExecutionProvider and TensorrtExecutionProvider I can not remark any improvement in performance (FPS) between the two models. The output indicates Unsupported ONNX data type: UINT8 (2) and throws many errors about the execution of multiClassNonMaxSuppression using TensorrtExecutionProvider, see screenshot below.
However, the model runs correctly but with 1.15 FPS
Is there another way to optimize the model to run faster on Jetson Nano
Any help is very appreciated,
Thanks in advance
System information
Jetson Nano 4GB
Jetpack: 4.6
Linux Ubuntu 18.04).06 LTS
ONNXRuntime-GPU installed from source
ONNX Runtime version: 1.10.0
Python version: 3.6
CUDA/cuDNN version: 10.2.3 / 8.2.1.32
GPU model and memory: Tegra
TensorRT 8.0.1.6
To Reproduce
convert the downloaded model from saved model to onnx using the command below: python3 -m tf2onnx.convert --saved-model ssd_mobilenet_v2_320x320_coco17_tpu-8/saved_model --output ssdmobilenetv2_320.onnx
install onnxruntime wget https://nvidia.box.com/shared/static/jy7nqva7l88mq9i8bw3g3sklzf4kccn2.whl -O onnxruntime_gpu-1.11.0-cp36-cp36m-linux_aarch64.whl
Install pip wheel sudo pip3 install onnxruntime_gpu-1.11.0-cp36-cp36m-linux_aarch64.whl
Install Sympy pip3 install sympy
Infer shapes in the model by running the python3 symbolic_shape_infer.py --input ssdmobilenetv2_320.onnx --output ssdmobilenetv2_320new.onnx –auto_merge
Expected behavior
Expect an improvement in performance (FPS) when running the model with TensorrtExecutionProvider
when executing the command without --fp16option I get the same error Unsupported ONNX data type: UINT8 (2), is there a solution to his problem
Any help is very appreciated
Kamal
Hi Aastalll,
I tried to change the dtype input data from uint8 to float using the script below:
import onnx_graphsurgeon as gs
import onnx
import numpy as np
print ("Patching the ONNX model.. ")
graph = gs.import_onnx(onnx.load("ssdmobilenetv2_320new.onnx"))
for inp in graph.inputs:
inp.dtype = np.float32
onnx.save(gs.export_onnx(graph),"updated_model.onnx")
print ("Check ONNX model using checker function and see if it passes...")
model = onnx.load("updated_model.onnx")
onnx.checker.check_model(model)
print('The model is checked!')
Using the new generated model I tried converting the model using trtexec /usr/src/tensorrt/bin/trtexec --onnx=/home/kamal/Desktop/Desktopnanoproject/ONNXRuntime/updated_model.onnx --fp16
The UINT8 error disappears but it throws another error, about Assertion failed: (inputs.at(1).is_weights()) && "This version of TensorRT only supports input K as an initializer."
See below:
/usr/src/tensorrt/bin/trtexec --onnx=/home/kamal/Desktop/Desktopnanoproject/ONNXRuntime/updated_model.onnx --fp16
&&&& RUNNING TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --onnx=/home/kamal/Desktop/Desktopnanoproject/ONNXRuntime/updated_model.onnx --fp16
[06/21/2022-11:19:24] [I] === Model Options ===
[06/21/2022-11:19:24] [I] Format: ONNX
[06/21/2022-11:19:24] [I] Model: /home/kamal/Desktop/Desktopnanoproject/ONNXRuntime/updated_model.onnx
[06/21/2022-11:19:24] [I] Output:
[06/21/2022-11:19:24] [I] === Build Options ===
[06/21/2022-11:19:24] [I] Max batch: explicit
[06/21/2022-11:19:24] [I] Workspace: 16 MiB
[06/21/2022-11:19:24] [I] minTiming: 1
[06/21/2022-11:19:24] [I] avgTiming: 8
[06/21/2022-11:19:24] [I] Precision: FP32+FP16
[06/21/2022-11:19:24] [I] Calibration:
[06/21/2022-11:19:24] [I] Refit: Disabled
[06/21/2022-11:19:24] [I] Sparsity: Disabled
[06/21/2022-11:19:24] [I] Safe mode: Disabled
[06/21/2022-11:19:24] [I] Restricted mode: Disabled
[06/21/2022-11:19:24] [I] Save engine:
[06/21/2022-11:19:24] [I] Load engine:
[06/21/2022-11:19:24] [I] NVTX verbosity: 0
[06/21/2022-11:19:24] [I] Tactic sources: Using default tactic sources
[06/21/2022-11:19:24] [I] timingCacheMode: local
[06/21/2022-11:19:24] [I] timingCacheFile:
[06/21/2022-11:19:24] [I] Input(s)s format: fp32:CHW
[06/21/2022-11:19:24] [I] Output(s)s format: fp32:CHW
[06/21/2022-11:19:24] [I] Input build shapes: model
[06/21/2022-11:19:24] [I] Input calibration shapes: model
[06/21/2022-11:19:24] [I] === System Options ===
[06/21/2022-11:19:24] [I] Device: 0
[06/21/2022-11:19:24] [I] DLACore:
[06/21/2022-11:19:24] [I] Plugins:
[06/21/2022-11:19:24] [I] === Inference Options ===
[06/21/2022-11:19:24] [I] Batch: Explicit
[06/21/2022-11:19:24] [I] Input inference shapes: model
[06/21/2022-11:19:24] [I] Iterations: 10
[06/21/2022-11:19:24] [I] Duration: 3s (+ 200ms warm up)
[06/21/2022-11:19:24] [I] Sleep time: 0ms
[06/21/2022-11:19:24] [I] Streams: 1
[06/21/2022-11:19:24] [I] ExposeDMA: Disabled
[06/21/2022-11:19:24] [I] Data transfers: Enabled
[06/21/2022-11:19:24] [I] Spin-wait: Disabled
[06/21/2022-11:19:24] [I] Multithreading: Disabled
[06/21/2022-11:19:24] [I] CUDA Graph: Disabled
[06/21/2022-11:19:24] [I] Separate profiling: Disabled
[06/21/2022-11:19:24] [I] Time Deserialize: Disabled
[06/21/2022-11:19:24] [I] Time Refit: Disabled
[06/21/2022-11:19:24] [I] Skip inference: Disabled
[06/21/2022-11:19:24] [I] Inputs:
[06/21/2022-11:19:24] [I] === Reporting Options ===
[06/21/2022-11:19:24] [I] Verbose: Disabled
[06/21/2022-11:19:24] [I] Averages: 10 inferences
[06/21/2022-11:19:24] [I] Percentile: 99
[06/21/2022-11:19:24] [I] Dump refittable layers:Disabled
[06/21/2022-11:19:24] [I] Dump output: Disabled
[06/21/2022-11:19:24] [I] Profile: Disabled
[06/21/2022-11:19:24] [I] Export timing to JSON file:
[06/21/2022-11:19:24] [I] Export output to JSON file:
[06/21/2022-11:19:24] [I] Export profile to JSON file:
[06/21/2022-11:19:24] [I]
[06/21/2022-11:19:24] [I] === Device Information ===
[06/21/2022-11:19:24] [I] Selected Device: NVIDIA Tegra X1
[06/21/2022-11:19:24] [I] Compute Capability: 5.3
[06/21/2022-11:19:24] [I] SMs: 1
[06/21/2022-11:19:24] [I] Compute Clock Rate: 0.9216 GHz
[06/21/2022-11:19:24] [I] Device Global Memory: 3964 MiB
[06/21/2022-11:19:24] [I] Shared Memory per SM: 64 KiB
[06/21/2022-11:19:24] [I] Memory Bus Width: 64 bits (ECC disabled)
[06/21/2022-11:19:24] [I] Memory Clock Rate: 0.01275 GHz
[06/21/2022-11:19:24] [I]
[06/21/2022-11:19:24] [I] TensorRT version: 8001
[06/21/2022-11:19:25] [I] [TRT] [MemUsageChange] Init CUDA: CPU +203, GPU +0, now: CPU 221, GPU 3787 (MiB)
[06/21/2022-11:19:25] [I] Start parsing network model
[06/21/2022-11:19:25] [I] [TRT] ----------------------------------------------------------------
[06/21/2022-11:19:25] [I] [TRT] Input filename: /home/kamal/Desktop/Desktopnanoproject/ONNXRuntime/updated_model.onnx
[06/21/2022-11:19:25] [I] [TRT] ONNX IR version: 0.0.8
[06/21/2022-11:19:25] [I] [TRT] Opset version: 13
[06/21/2022-11:19:25] [I] [TRT] Producer name: tf2onnx
[06/21/2022-11:19:25] [I] [TRT] Producer version: 1.10.0 e9b6cb
[06/21/2022-11:19:25] [I] [TRT] Domain:
[06/21/2022-11:19:25] [I] [TRT] Model version: 0
[06/21/2022-11:19:25] [I] [TRT] Doc string:
[06/21/2022-11:19:25] [I] [TRT] ----------------------------------------------------------------
[06/21/2022-11:19:25] [W] [TRT] onnx2trt_utils.cpp:364: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[06/21/2022-11:19:25] [W] [TRT] onnx2trt_utils.cpp:390: One or more weights outside the range of INT32 was clamped
[06/21/2022-11:20:11] [E] [TRT] ModelImporter.cpp:720: While parsing node number 2922 [TopK -> "StatefulPartitionedCall/Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/SortByField/TopKV2:0"]:
[06/21/2022-11:20:11] [E] [TRT] ModelImporter.cpp:721: --- Begin node ---
[06/21/2022-11:20:11] [E] [TRT] ModelImporter.cpp:722: input: "StatefulPartitionedCall/Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/Concatenate/concat_1:0"
input: "Unsqueeze__4282:0"
output: "StatefulPartitionedCall/Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/SortByField/TopKV2:0"
output: "StatefulPartitionedCall/Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/SortByField/TopKV2:1"
name: "StatefulPartitionedCall/Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/SortByField/TopKV2"
op_type: "TopK"
attribute {
name: "sorted"
i: 1
type: INT
}
[06/21/2022-11:20:11] [E] [TRT] ModelImporter.cpp:723: --- End node ---
[06/21/2022-11:20:11] [E] [TRT] ModelImporter.cpp:726: ERROR: builtin_op_importers.cpp:4292 In function importTopK:
[8] Assertion failed: (inputs.at(1).is_weights()) && "This version of TensorRT only supports input K as an initializer."
[06/21/2022-11:20:11] [E] Failed to parse onnx file
[06/21/2022-11:20:11] [I] Finish parsing network model
[06/21/2022-11:20:11] [E] Parsing model failed
[06/21/2022-11:20:11] [E] Engine creation failed
[06/21/2022-11:20:11] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --onnx=/home/kamal/Desktop/Desktopnanoproject/ONNXRuntime/updated_model.onnx --fp16
Any help is very appreciated
Thanks in advance
Kamal
The error is related to some non-supported layers.
Is it possible to upgrade your device to the JetPack 4.6.2 with TensorRT 8.2?
If yes, could you try the below sample that deploys the ssd_mobilenet_v2_320x320_coco17_tpu-8 model with TensorRT?
Hi AastaLLL,
Thank you for your reply.
It is correct to build TensorRT engine you have to upgrade your Jetpack to 4.6.1 or 4.6.2 since they come with TensorRT 8.2.1. However, re-exporting the TensorFlow or creating the onnx model can not be done on the Jetson because we can not install TensorFlow 2.5 on the Jetpack 4.6.1 & 4.6.2 . Will Nvidia provides the TensorFlow 2.5 python 3.6 Linux Arm arch for these Jetpack. https://developer.download.nvidia.com/compute/redist/jp/v461/tensorflow/
Thanks in advance