Description
Hi, I was trying to use TensorRT 8.2 (via OnnxRuntime Tensorrt execution provider) to test if the inference speed of a transformer model (a fine-tuned model from pegasus) can be improved compared to using CUDA execution provider only.
My task is a summarization task. I choose the text with the largest number of tokens for the warmup, then run the inferences for about ~2500 paragraphs sequentially in a python script.
For most of the time, Tensorrt execution provider could improve the speed compared to the cuda provider, but there are a few runs the latency increases dramatically.
In my case, inputs are 30-300 tokens long. Most of the inference takes 100-200ms (after the warmup), but for some inputs after the warmup, the latency can be 400,000 - 500,000 ms, which is a very high increase.
Interestingly, when I rerun with the same execution order, the same sentences still result in such latency. The indices of the sentences with high latency are [1, 4, 15, 16, 80, 1652, 2160] in my latest test, so they don’t often happen in the early runs. I recorded the number of input tokens of them, and they are not particularly long – some of them only 30-40 tokens.
I wonder whether this occasional latency increase is expected, or something might be wrong with this? Thanks very much.
My setup is the tensorrt container 21.12-py3 + Onnxruntime v1.10.0.
Environment
TensorRT Version: 8.2
GPU Type: V100
Baremetal or Container (if container which image + tag): nvidia tensorrt container 21.12-py3
** Onnx runtime version: v1.10.0