Description
Coming from this topic: https://forums.developer.nvidia.com/t/inference-time-becomes-longer-when-doing-non-continuous-fp16-or-int8-inference/184127
We have a loop of inference using ONNX built on CUDA (the same happened with ONNX/TensorRT tensorflow/CUDA, or anything on top of CUDA). If there is no pause in between the inference, the inference time is very stable around 6-7ms. However if we put a pause in between the inference, the inference time shot to a few hundred milliseconds.
The sciprt, model and sample image are attached
output:
GPU
preprocess time: 0.0
[‘602.5’, ‘6.0’, ‘5.7’, ‘6.0’, ‘6.0’, ‘6.0’, ‘7.0’, ‘6.7’, ‘12.9’, ‘11.8’, ‘7.0’, ‘7.9’, ‘7.7’, ‘12.8’, ‘12.8’, ‘15.9’, ‘13.0’, ‘13.0’, ‘14.8’, ‘13.8’, ‘22.0’, ‘21.8’, ‘23.9’, ‘23.8’, ‘24.8’, ‘25.0’, ‘23.9’, ‘26.0’, ‘26.0’, ‘23.8’, ‘24.0’, ‘25.8’, ‘25.0’, ‘27.1’, ‘25.0’, ‘27.0’, ‘25.8’, ‘27.8’, ‘27.1’, ‘26.0’, ‘27.0’, ‘28.1’, ‘26.8’, ‘24.8’, ‘25.8’, ‘25.8’, ‘26.0’, ‘26.9’, ‘26.8’, ‘27.0’, ‘27.0’, ‘24.8’, ‘26.8’, ‘27.8’, ‘26.8’, ‘26.0’, ‘27.0’, ‘25.0’, ‘24.8’, ‘27.0’, ‘24.8’, ‘27.0’, ‘27.0’, ‘27.1’, ‘25.9’, ‘24.9’, ‘27.8’, ‘27.0’, ‘27.0’, ‘27.8’, ‘26.8’, ‘27.0’, ‘27.0’, ‘24.7’, ‘25.0’, ‘28.1’, ‘26.0’, ‘26.9’, ‘24.7’, ‘24.8’, ‘25.0’, ‘26.8’, ‘27.0’, ‘27.0’, ‘26.0’, ‘29.0’, ‘25.0’, ‘25.0’, ‘24.7’, ‘28.1’, ‘28.0’, ‘27.0’, ‘26.0’, ‘25.0’, ‘27.0’, ‘26.9’, ‘27.9’, ‘26.8’, ‘27.9’, ‘24.8’]
[Finished in 20.2s]
When I removed the time.sleep(0.1), the inference time became very short:
GPU
preprocess time: 1.0004043579101562
[‘618.3’, ‘6.0’, ‘5.0’, ‘10.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’]
[Finished in 7.6s]
So my question is why the inference time suddenly becomes slower when there’s a pause in between, and what can we do to prevent this?
Thank you!
Environment
TensorRT Version:We dont use TensorRT but ONNX on CUDA
GPU Type: NVIDIA 2080Ti
Nvidia Driver Version:
CUDA Version: 11.5
CUDNN Version:
Operating System + Version: Windows 10
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):
Relevant Files
Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)
test_trt_short.py (1.8 KB)
model.onnx (10.3 MB)
Steps To Reproduce
Please include:
- Exact steps/commands to build your repro
- Exact steps/commands to run your repro
- Full traceback of errors encountered