TensorRT execution inference time occasionally increases dramatically after the warmup


Hi, I was trying to use TensorRT 8.2 (via OnnxRuntime Tensorrt execution provider) to test if the inference speed of a transformer model (a fine-tuned model from pegasus) can be improved compared to using CUDA execution provider only.

My task is a summarization task. I choose the text with the largest number of tokens for the warmup, then run the inferences for about ~2500 paragraphs sequentially in a python script.

For most of the time, Tensorrt execution provider could improve the speed compared to the cuda provider, but there are a few runs the latency increases dramatically.

In my case, inputs are 30-300 tokens long. Most of the inference takes 100-200ms (after the warmup), but for some inputs after the warmup, the latency can be 400,000 - 500,000 ms, which is a very high increase.

Interestingly, when I rerun with the same execution order, the same sentences still result in such latency. The indices of the sentences with high latency are [1, 4, 15, 16, 80, 1652, 2160] in my latest test, so they don’t often happen in the early runs. I recorded the number of input tokens of them, and they are not particularly long – some of them only 30-40 tokens.

I wonder whether this occasional latency increase is expected, or something might be wrong with this? Thanks very much.

My setup is the tensorrt container 21.12-py3 + Onnxruntime v1.10.0.


TensorRT Version: 8.2
GPU Type: V100
Baremetal or Container (if container which image + tag): nvidia tensorrt container 21.12-py3
** Onnx runtime version: v1.10.0

Request you to share the model, script, profiler and performance output if not shared already so that we can help you better.
Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer below link for more details: