Performance DECREASE with tensorRT under onnxruntime, pt2

user37927 · April 20, 2022, 8:42pm

Hi All,

I originally posted this back in January, but when a response in March came I wasn’t in a position to update/respond. Now I am!

I’m working on putting an onnx format image classifier NN model (inception) on a Jetson Xavier AGX (w/ Jetpack 4.6). I’ve gotten it to work with onnxruntime in a docker container with CUDAExecutionProvider and TensorrtExecutionProvider providers.

I was expecting a speed-up from using TensorRT with my models. Instead I’m seeing a significant (15x) slowdown. What am I missing?

The following runs show the seconds it took to run an inception_v4 model on 100 images using CUDAExecutionProvider and TensorrtExecutionProvider respectively. The models were trained and converted to onnx using pytorch on a different computer (can be provided on request). The following runs are executed through docker on the Jetson AGX device in MAXN mode.
Using JTop I can see that with CUDAExecutionProvider the GPU is always fully engaged, and with TensorrtExecutionProvider the GPU is intermittently engaged, like it’s sputtering.

        inception_v4
CUDA             16s
TRT             254s

So the best speed I’m getting is ~6img/sec. Shouldn’t I be able to crank out more frames per seconds? What’s holding up the processing.

I’ve attached a zip of the project directory. NOT included in the zip is the JetsonZoo onnxruntime_gpu-1.11.0-cp36-cp36m-linux_aarch64 wheel and the image data (“app/testdata/testset/” from run_job.sh).

The classifier can be forced to use cuda or trt by editing the run_job.sh file.
If there’s content you need to get into the specifics, let me know!
Thanks for your help!

jetsonproject_2022-04-20.zip (72.9 MB)

AastaLLL · April 21, 2022, 3:23am

Hi,

Thanks for reporting this.

We are going to reproduce this issue internally.
Will share more information with you later.

Thanks.

AastaLLL · April 22, 2022, 6:08am

Hi,

The latency comes from TensorRT converting the ONNX model into a TRT engine.
If you serialize the engine file, you don’t need to do the conversion next time.

We can get much better performance by loading the engine file directly.
Please note that you can use the fp16 or int8 mode to optimize the inference further.

For example:
https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html

$ export ORT_TENSORRT_ENGINE_CACHE_ENABLE=1
$ export ORT_TENSORRT_CACHE_PATH=./engine
$ python3 main.py models/20220115_Jan2022_NES21__1_iv4.FP16.onnx data/ --classfile models/20220115_Jan2022_NES21__1_iv4.FP16.classes --mode trt
2022-04-22 13:57:08.278362090 [W:onnxruntime:Default, tensorrt_execution_provider.h:53 log] [2022-04-22 05:57:08 WARNING] /home/onnxruntime/onnxruntime-py36/cmake/external/onnx-tensorrt/onnx2trt_utils.cpp:364: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
['Copepod_nauplii', 'Copepod_nauplii', 'Copepod_nauplii', 'Copepod_nauplii', 'Copepod_nauplii', 'Copepod_nauplii', 'Copepod_nauplii', 'Copepod_nauplii', 'Copepod_nauplii', 'Copepod_nauplii']
Final Summary: 10 imgs in 6.2053 secs IE 1.6 img/s

Thanks.

user37927 · April 23, 2022, 3:34am

Hi AastaLLL,

This solved my issue! Using a pre-built engine almost doubles the images-per-seconds vs. using CUDA.

Thank you for your guidance!

rgovostes · May 3, 2022, 2:23pm

@AastaLLL Thanks for helping us with this. The use of the cached engine has improved our inference throughput.

However, we are still seeing that ONNXRuntime with the TensorRT execution provider is performing much worse than using TensorRT directly (i.e., when benchmarked via the trtexec or polygraphy tools) on the Jetson Xavier AGX.

In comparing the TRT engines produced by TensorRT when invoked directly or via ONNX Runtime, I find that they have very similar performance.

Yet when I compare ONNXRT+TRT vs TRT directly, with pre-built engines, I see a large performance gap, 30% or more.

Therefore I suspect the performance loss is caused by the ONNX Runtime API itself adding overhead. Has your team ever seen this? Is there any way to figure out what part of the pipeline is causing it?

The model I am testing with is a standard Inception v3 model in ONNX format (exported with PyTorch). All testing was done using the polygraphy tool.

I filed an issue describing our setup here: Lower performance on Inceptionv3/4 model with TensorRT EP than TensorRT directly · Issue #11356 · microsoft/onnxruntime · GitHub

system · May 25, 2022, 3:04am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Performance DECREASE with tensorRT under onnxruntime Jetson AGX Xavier tensorrt	2	847	March 8, 2022
Onnx -> TensorRT. No speed difference between models TensorRT	1	550	June 24, 2021
Clarity needed on differences between acceleration frameworks/runtimes for AGX Xavier Jetson AGX Xavier tensorrt , cuda , onnx	4	1384	October 18, 2021
Unable to use TensorRTExecution Provider on Jetson AGX Xavier Jetson AGX Xavier tensorrt	9	795	April 18, 2024
Could not infer onnx model for TensorrtExecutionProvider provider TensorRT tensorrt , onnx	1	1227	November 11, 2022
Tensor RT optimization causes performance downgrade compared to onnx model TensorRT	4	1047	January 26, 2022
Huge speed difference between engines built from scratch and engines built from onnx Jetson AGX Xavier tensorrt , nvbugs	11	1027	August 3, 2021
End-to-End AI for NVIDIA-Based PCs: CUDA and TensorRT Execution Providers in ONNX Runtime Technical Blog	6	1386	October 31, 2024
tensorRT inference unstable compared onnxruntime TensorRT	4	1454	May 4, 2021
Extremely slow inference in TensorRT for live semantic segmentation model Jetson AGX Xavier tensorrt , tensorflow , jetson-inference	11	4531	April 12, 2022

Performance DECREASE with tensorRT under onnxruntime, pt2

Related topics