I use tensorRT to test some models latency and found that the result of big model (more parameters) fluctuated back and forth, but the small model’s latency is stably. I reckon that it is a normal phenomenon or i just use tensor RT in a wrong way
And a model’s latency will be stably if its latency is less than 5ms (ignoring the time for cpoying data to gpu memory )
Environment
TensorRT Version: 6.0.1.5 GPU Type: NVIDIA T4 Nvidia Driver Version: 440.33.01 CUDA Version: 10.2 CUDNN Version: 7.6.5 Operating System + Version: 7.6.5 Python Version (if applicable): 3.6.4 Persistence-M: ON
**Volatile Uncorr. ECC:**OFF
code
#copy data to gpu in a synchronised way
cuda.memcpy_htod()
s_time = time.time()
#inference in a synchronised way
xxx.execute()
infer_time = time.time() - s_time
Fluctuation in latency is expected. It depends on memory and other resources are available. Please make sure every time you have same GPU memory available.
And looks like you’re using very old version of TensorRT. We recommend you to please try on latest TensorRT version. If you still face this issue, we request you to share issue reproducible onnx model and scripts.