I converted 2 models to TRT using TF-TRT to TRT-FP-32 and TRT-FP-16, and I see a good speedup in inference time.
Having said that, I have 2 problems:
- the first inference takes time (for one model 30s, and 90s for the other) and that’s too long for my application. Is it something known in TensorRT?
The problem is specifically in this line:
pred = infer(batch)['tf.math.sigmoid']
Is it possible to serialize a model in such way to cut this time, assuming after :
model = tf.saved_model.load(model_path, tags=[tag_constants.SERVING]) infer = model.signatures['serving_default']
Assuming TRT still has to do some optimizations before first inference?
- When I run two models together in the same loop (perform prediction with one and then perform prediction with another) just to evaluate if using 2 models together does run slow, I do see very slow inference time for both models.
Some background - my application predicts an image using the first model and then doing a few predictions on the first model’s outputs using the second model.
Doing that with 2 TFTRT models resulted in a dramatic increase in inference time.
Any ideas on why this happens and how I should approach it (expect to create a new architecture that performs both stages in one architecture)?
TensorRT Version: 22.214.171.124
GPU Type: RTX 3060 (Laptop)
Nvidia Driver Version: 515
CUDA Version: running nvcc --version returns r11.7
Operating System + Version: Ubuntu 20.04
Python Version (if applicable): 3.8.10
TensorFlow Version (if applicable): 2.9.1
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag): nvcr.io/nvidia/tensorflow:22.06-tf2-py3