TensorRT Inference Slower When Loading Searalized Engine than Building on the Fly

bk4 · March 8, 2021, 8:20pm

Description

When running the sample: /usr/src/tensorrt/samples/python/uff_ssd/ inference happens much faster when building the engine and then inferencing with it (23 ms). However, after running the sample again after the first time when the engine has been built, serialized, and saved then loaded and deserialized, inferencing is much slower (1100 ms).

The bounding box outputs are exactly the same.

I would like to have a saved engine that can be loaded rather than having to build every time. Is there a reason building on the fly would inference faster than loading an engine file?

What can I do to solve this?

Thanks.

Environment

Jetson AGX Xavier [16GB] - Jetpack 4.4.1 [L4T 32.4.4]

TensorRT Version: 7.1.3.0
Python Version: 3.6.9
TensorFlow Version: 1.15.2+nv20.6

NVES · March 8, 2021, 8:37pm

Hi,
Can you try running your model with trtexec command, and share the “”–verbose"" log in case if the issue persist
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

You can refer below link for all the supported operators list, in case any operator is not supported you need to create a custom plugin to support that operation

github.com

onnx/onnx-tensorrt/blob/main/docs/operators.md

<!--- SPDX-License-Identifier: Apache-2.0 -->

# Supported ONNX Operators

TensorRT 8.4 supports operators up to Opset 17. Latest information of ONNX operators can be found [here](https://github.com/onnx/onnx/blob/master/docs/Operators.md)

TensorRT supports the following ONNX data types: DOUBLE, FLOAT32, FLOAT16, INT8, and BOOL

> Note: There is limited support for INT32, INT64, and DOUBLE types. TensorRT will attempt to cast down INT64 to INT32 and DOUBLE down to FLOAT, clamping values to `+-INT_MAX` or `+-FLT_MAX` if necessary.

See below for the support matrix of ONNX operators in ONNX-TensorRT.

## Operator Support Matrix

| Operator                  | Supported  | Supported Types | Restrictions                                                                                                           |
|---------------------------|------------|-----------------|------------------------------------------------------------------------------------------------------------------------|
| Abs                       | Y          | FP32, FP16, INT32 |
| Acos                      | Y          | FP32, FP16 |
| Acosh                     | Y          | FP32, FP16 |
| Add                       | Y          | FP32, FP16, INT32 |

This file has been truncated. show original

Also, request you to share your model and script if not shared already so that we can help you better.

Thanks!

bk4 · March 8, 2021, 10:43pm

@NVES Thanks for the response. Running with trtexec gives me the low latency output i expect. (See trtExecOutput.txt)

This shows me that the saved engine is correct so maybe it’s an issue with the python API?

I am referring to: /usr/src/tensorrt/samples/python/uff_ssd/detect_objects.py

When I build the engine on the fly with this code:

And time the inference with this code:

I get this output (inference time as expected) :

I have tried saving the engine like this (Configuration 1):

And like this (Configuration 2):

With loading the engine like this (Configuration 1):

And like this (Configuration 2):

In both configurations I get the same output (Inference time slower than expected):

I understand this python timing code is not as accurate as the timing code executed with trtexec, but the difference is large enough where i suspect something is up.

Do you have any further thoughts?

Thanks!

trtExecOutput.txt (6.8 KB)

spolisetty · March 9, 2021, 3:59pm

Hi @bk4,

Could you please confirm how many times do you run the inference when collecting the data.

Thank you.

bk4 · March 9, 2021, 4:44pm

Hi @spolisetty ,

I have confirmed the inference is run once.

Thanks.

spolisetty · March 10, 2021, 3:29pm

Hi @bk4,

Probably the issue is library startup time (mostly loading SASS onto the GPU.)
If you’re running the builder and then starting the timer, that time will not be observed.
Is the application a one-shot inference? If not, then you need to time the second invocation.

Thank you.

bk4 · March 10, 2021, 8:46pm

Thanks @spolisetty !

That was indeed the issue. The first time inference was timing the library startup time. All subsequent inferences had the expected timing.

Thanks

Topic		Replies	Views
ONNX Model Int64 Weights TensorRT	12	13362	February 17, 2024
Different inference time when loading engine from serialized file TensorRT tensorrt	14	1478	November 2, 2021
Inference time changes after training TensorRT tensorrt	5	578	September 25, 2020
Tensorrt Execution Provider TensorRT tensorrt , cudnn , onnx	1	816	November 27, 2023
Inference time of engine with dynamic batch size is not good? TensorRT	1	263	November 27, 2023
Performance discrepancy using TensorRT engines TensorRT tensorrt	3	659	October 5, 2021
Infer time after conversion and ram usage TensorRT tensorrt	12	1029	February 15, 2022
Inference time on jetson nano Jetson AGX Xavier tensorrt , cuda , kernel , jetson-inference	2	940	May 30, 2022
Model inferenced with tensorrt is slower than regular pytorch TensorRT cudnn	1	467	February 16, 2024
High RAM consumption with CUDA and TensorRT on Jetson Xavier NX Jetson Xavier NX tensorrt	10	2848	October 18, 2021

TensorRT Inference Slower When Loading Searalized Engine than Building on the Fly

Description

Environment

Related topics