Slow first inference and very slow two models inference

Description

I converted 2 models to TRT using TF-TRT to TRT-FP-32 and TRT-FP-16, and I see a good speedup in inference time.
Having said that, I have 2 problems:

  1. the first inference takes time (for one model 30s, and 90s for the other) and that’s too long for my application. Is it something known in TensorRT?
    The problem is specifically in this line:
    pred = infer(batch)['tf.math.sigmoid']

Is it possible to serialize a model in such way to cut this time, assuming after :

model = tf.saved_model.load(model_path, tags=[tag_constants.SERVING])
infer = model.signatures['serving_default']

Assuming TRT still has to do some optimizations before first inference?

  1. When I run two models together in the same loop (perform prediction with one and then perform prediction with another) just to evaluate if using 2 models together does run slow, I do see very slow inference time for both models.
    Some background - my application predicts an image using the first model and then doing a few predictions on the first model’s outputs using the second model.
    Doing that with 2 TFTRT models resulted in a dramatic increase in inference time.
    Any ideas on why this happens and how I should approach it (expect to create a new architecture that performs both stages in one architecture)?

Environment

TensorRT Version: 8.2.5.1
GPU Type: RTX 3060 (Laptop)
Nvidia Driver Version: 515
CUDA Version: running nvcc --version returns r11.7
CUDNN Version:
Operating System + Version: Ubuntu 20.04
Python Version (if applicable): 3.8.10
TensorFlow Version (if applicable): 2.9.1
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag): nvcr.io/nvidia/tensorflow:22.06-tf2-py3

Hi,

Request you to share the model, script, profiler, and performance output if not shared already so that we can help you better.

Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer to the below links for more details:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#measure-performance

https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#model-accuracy

Thanks!

My application has a visual display of the results. Using the 2 TRT models results in inference that is significantly slower than using the original models. So it’s not even about correct inference-time measurement (plus, when I perform simple predictions with the two models together outside of my application, like I mentioned above, then only the prediction time is measured, excluding pre or post processing time)

Hi,

Will get back to you on queries.
If possible, could you please share with us a minimal issue repro script/model to try from our end for better debugging.

Thank you.