TensorRT Inference Slower When Loading Searalized Engine than Building on the Fly

Description

When running the sample: /usr/src/tensorrt/samples/python/uff_ssd/ inference happens much faster when building the engine and then inferencing with it (23 ms). However, after running the sample again after the first time when the engine has been built, serialized, and saved then loaded and deserialized, inferencing is much slower (1100 ms).

The bounding box outputs are exactly the same.

I would like to have a saved engine that can be loaded rather than having to build every time. Is there a reason building on the fly would inference faster than loading an engine file?

What can I do to solve this?

Thanks.

Environment

Jetson AGX Xavier [16GB] - Jetpack 4.4.1 [L4T 32.4.4]

TensorRT Version: 7.1.3.0
Python Version: 3.6.9
TensorFlow Version: 1.15.2+nv20.6

Hi,
Can you try running your model with trtexec command, and share the “”–verbose"" log in case if the issue persist
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

You can refer below link for all the supported operators list, in case any operator is not supported you need to create a custom plugin to support that operation

Also, request you to share your model and script if not shared already so that we can help you better.

Thanks!

@NVES Thanks for the response. Running with trtexec gives me the low latency output i expect. (See trtExecOutput.txt)

This shows me that the saved engine is correct so maybe it’s an issue with the python API?

I am referring to: /usr/src/tensorrt/samples/python/uff_ssd/detect_objects.py

When I build the engine on the fly with this code:

image

And time the inference with this code:

image

I get this output (inference time as expected) :

image

I have tried saving the engine like this (Configuration 1):

image

And like this (Configuration 2):

image

With loading the engine like this (Configuration 1):

And like this (Configuration 2):

In both configurations I get the same output (Inference time slower than expected):

I understand this python timing code is not as accurate as the timing code executed with trtexec, but the difference is large enough where i suspect something is up.

Do you have any further thoughts?

Thanks!

trtExecOutput.txt (6.8 KB)

Hi @bk4,

Could you please confirm how many times do you run the inference when collecting the data.

Thank you.

Hi @spolisetty ,

I have confirmed the inference is run once.

Thanks.

Hi @bk4,

Probably the issue is library startup time (mostly loading SASS onto the GPU.)
If you’re running the builder and then starting the timer, that time will not be observed.
Is the application a one-shot inference? If not, then you need to time the second invocation.

Thank you.

1 Like

Thanks @spolisetty !

That was indeed the issue. The first time inference was timing the library startup time. All subsequent inferences had the expected timing.

Thanks

1 Like