Tensorrt inference time fluctuated when test a big model


I use tensorRT to test some models latency and found that the result of big model (more parameters) fluctuated back and forth, but the small model’s latency is stably. I reckon that it is a normal phenomenon or i just use tensor RT in a wrong way

And a model’s latency will be stably if its latency is less than 5ms (ignoring the time for cpoying data to gpu memory )


TensorRT Version:
Nvidia Driver Version: 440.33.01
CUDA Version: 10.2
CUDNN Version: 7.6.5
Operating System + Version: 7.6.5
Python Version (if applicable): 3.6.4
Persistence-M: ON
**Volatile Uncorr. ECC:**OFF


#copy data to gpu in a synchronised way
s_time = time.time()

#inference in a synchronised way
infer_time = time.time() - s_time

Request you to share the model, script, profiler and performance output if not shared already so that we can help you better.
Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer below link for more details:


Hi @928024300,

Fluctuation in latency is expected. It depends on memory and other resources are available. Please make sure every time you have same GPU memory available.

And looks like you’re using very old version of TensorRT. We recommend you to please try on latest TensorRT version. If you still face this issue, we request you to share issue reproducible onnx model and scripts.

Thank you.