TensorRT execution inference time occasionally increases dramatically after the warmup

brevity2021 · January 7, 2022, 6:12am

Description

Hi, I was trying to use TensorRT 8.2 (via OnnxRuntime Tensorrt execution provider) to test if the inference speed of a transformer model (a fine-tuned model from pegasus) can be improved compared to using CUDA execution provider only.

My task is a summarization task. I choose the text with the largest number of tokens for the warmup, then run the inferences for about ~2500 paragraphs sequentially in a python script.

For most of the time, Tensorrt execution provider could improve the speed compared to the cuda provider, but there are a few runs the latency increases dramatically.

In my case, inputs are 30-300 tokens long. Most of the inference takes 100-200ms (after the warmup), but for some inputs after the warmup, the latency can be 400,000 - 500,000 ms, which is a very high increase.

Interestingly, when I rerun with the same execution order, the same sentences still result in such latency. The indices of the sentences with high latency are [1, 4, 15, 16, 80, 1652, 2160] in my latest test, so they don’t often happen in the early runs. I recorded the number of input tokens of them, and they are not particularly long – some of them only 30-40 tokens.

I wonder whether this occasional latency increase is expected, or something might be wrong with this? Thanks very much.

My setup is the tensorrt container 21.12-py3 + Onnxruntime v1.10.0.

Environment

TensorRT Version: 8.2
GPU Type: V100
Baremetal or Container (if container which image + tag): nvidia tensorrt container 21.12-py3
** Onnx runtime version: v1.10.0

NVES · January 7, 2022, 6:38am

Hi,
Request you to share the model, script, profiler and performance output if not shared already so that we can help you better.
Alternatively, you can try running your model with trtexec command.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer below link for more details:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-722/best-practices/index.html#measure-performance
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#model-accuracy

Thanks!

Topic		Replies	Views
Tensorrt inference time fluctuated when test a big model TensorRT tensorrt	2	773	June 4, 2021
Strange CNN inference latency behavior with CUDA and TensorRT TensorRT cuda	13	1687	January 24, 2024
Inference time is not improving with the increase in batch size TensorRT	8	2136	June 1, 2022
BIggest Latency in TensorRT TensorRT cudnn	1	359	October 19, 2023
TensorRT on RTX 3080 slow down TensorRT tensorrt	6	2153	September 16, 2022
TensorRT inference time increase TensorRT cuda	5	537	April 5, 2021
There is a difference in inference speed in TensorRT 8 TensorRT tensorrt	4	590	October 28, 2021
Inconsistent TensorRT Inference Time on Jetson Xavier NX TensorRT	5	201	March 4, 2025
TorchTensorRT lowering performance in real time inference TensorRT	1	430	July 6, 2023
Tensorrt cold start (First time inference) TensorRT tensorrt , cuda , ubuntu , python	2	393	May 30, 2024

TensorRT execution inference time occasionally increases dramatically after the warmup

Description

Environment

Related topics