Performance DECREASE with tensorRT under onnxruntime, pt2

Hi All,

I originally posted this back in January, but when a response in March came I wasn’t in a position to update/respond. Now I am!

I’m working on putting an onnx format image classifier NN model (inception) on a Jetson Xavier AGX (w/ Jetpack 4.6). I’ve gotten it to work with onnxruntime in a docker container with CUDAExecutionProvider and TensorrtExecutionProvider providers.

I was expecting a speed-up from using TensorRT with my models. Instead I’m seeing a significant (15x) slowdown. What am I missing?

The following runs show the seconds it took to run an inception_v4 model on 100 images using CUDAExecutionProvider and TensorrtExecutionProvider respectively. The models were trained and converted to onnx using pytorch on a different computer (can be provided on request). The following runs are executed through docker on the Jetson AGX device in MAXN mode.
Using JTop I can see that with CUDAExecutionProvider the GPU is always fully engaged, and with TensorrtExecutionProvider the GPU is intermittently engaged, like it’s sputtering.

        inception_v4
CUDA             16s
TRT             254s

So the best speed I’m getting is ~6img/sec. Shouldn’t I be able to crank out more frames per seconds? What’s holding up the processing.

I’ve attached a zip of the project directory. NOT included in the zip is the JetsonZoo onnxruntime_gpu-1.11.0-cp36-cp36m-linux_aarch64 wheel and the image data (“app/testdata/testset/” from run_job.sh).

The classifier can be forced to use cuda or trt by editing the run_job.sh file.
If there’s content you need to get into the specifics, let me know!
Thanks for your help!

jetsonproject_2022-04-20.zip (72.9 MB)

1 Like

Hi,

Thanks for reporting this.

We are going to reproduce this issue internally.
Will share more information with you later.

Thanks.

Hi,

The latency comes from TensorRT converting the ONNX model into a TRT engine.
If you serialize the engine file, you don’t need to do the conversion next time.

We can get much better performance by loading the engine file directly.
Please note that you can use the fp16 or int8 mode to optimize the inference further.

For example:
https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html

$ export ORT_TENSORRT_ENGINE_CACHE_ENABLE=1
$ export ORT_TENSORRT_CACHE_PATH=./engine
$ python3 main.py models/20220115_Jan2022_NES21__1_iv4.FP16.onnx data/ --classfile models/20220115_Jan2022_NES21__1_iv4.FP16.classes --mode trt
2022-04-22 13:57:08.278362090 [W:onnxruntime:Default, tensorrt_execution_provider.h:53 log] [2022-04-22 05:57:08 WARNING] /home/onnxruntime/onnxruntime-py36/cmake/external/onnx-tensorrt/onnx2trt_utils.cpp:364: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
['Copepod_nauplii', 'Copepod_nauplii', 'Copepod_nauplii', 'Copepod_nauplii', 'Copepod_nauplii', 'Copepod_nauplii', 'Copepod_nauplii', 'Copepod_nauplii', 'Copepod_nauplii', 'Copepod_nauplii']
Final Summary: 10 imgs in 6.2053 secs IE 1.6 img/s

Thanks.

Hi AastaLLL,

This solved my issue! Using a pre-built engine almost doubles the images-per-seconds vs. using CUDA.

Thank you for your guidance!

@AastaLLL Thanks for helping us with this. The use of the cached engine has improved our inference throughput.

However, we are still seeing that ONNXRuntime with the TensorRT execution provider is performing much worse than using TensorRT directly (i.e., when benchmarked via the trtexec or polygraphy tools) on the Jetson Xavier AGX.

In comparing the TRT engines produced by TensorRT when invoked directly or via ONNX Runtime, I find that they have very similar performance.

Yet when I compare ONNXRT+TRT vs TRT directly, with pre-built engines, I see a large performance gap, 30% or more.

Therefore I suspect the performance loss is caused by the ONNX Runtime API itself adding overhead. Has your team ever seen this? Is there any way to figure out what part of the pipeline is causing it?

The model I am testing with is a standard Inception v3 model in ONNX format (exported with PyTorch). All testing was done using the polygraphy tool.

I filed an issue describing our setup here: Lower performance on Inceptionv3/4 model with TensorRT EP than TensorRT directly · Issue #11356 · microsoft/onnxruntime · GitHub

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.