First inference after a pause is always long

Description

Coming from this topic: https://forums.developer.nvidia.com/t/inference-time-becomes-longer-when-doing-non-continuous-fp16-or-int8-inference/184127

We have a loop of inference using ONNX built on CUDA (the same happened with ONNX/TensorRT tensorflow/CUDA, or anything on top of CUDA). If there is no pause in between the inference, the inference time is very stable around 6-7ms. However if we put a pause in between the inference, the inference time shot to a few hundred milliseconds.

The sciprt, model and sample image are attached

output:

GPU
preprocess time: 0.0
[‘602.5’, ‘6.0’, ‘5.7’, ‘6.0’, ‘6.0’, ‘6.0’, ‘7.0’, ‘6.7’, ‘12.9’, ‘11.8’, ‘7.0’, ‘7.9’, ‘7.7’, ‘12.8’, ‘12.8’, ‘15.9’, ‘13.0’, ‘13.0’, ‘14.8’, ‘13.8’, ‘22.0’, ‘21.8’, ‘23.9’, ‘23.8’, ‘24.8’, ‘25.0’, ‘23.9’, ‘26.0’, ‘26.0’, ‘23.8’, ‘24.0’, ‘25.8’, ‘25.0’, ‘27.1’, ‘25.0’, ‘27.0’, ‘25.8’, ‘27.8’, ‘27.1’, ‘26.0’, ‘27.0’, ‘28.1’, ‘26.8’, ‘24.8’, ‘25.8’, ‘25.8’, ‘26.0’, ‘26.9’, ‘26.8’, ‘27.0’, ‘27.0’, ‘24.8’, ‘26.8’, ‘27.8’, ‘26.8’, ‘26.0’, ‘27.0’, ‘25.0’, ‘24.8’, ‘27.0’, ‘24.8’, ‘27.0’, ‘27.0’, ‘27.1’, ‘25.9’, ‘24.9’, ‘27.8’, ‘27.0’, ‘27.0’, ‘27.8’, ‘26.8’, ‘27.0’, ‘27.0’, ‘24.7’, ‘25.0’, ‘28.1’, ‘26.0’, ‘26.9’, ‘24.7’, ‘24.8’, ‘25.0’, ‘26.8’, ‘27.0’, ‘27.0’, ‘26.0’, ‘29.0’, ‘25.0’, ‘25.0’, ‘24.7’, ‘28.1’, ‘28.0’, ‘27.0’, ‘26.0’, ‘25.0’, ‘27.0’, ‘26.9’, ‘27.9’, ‘26.8’, ‘27.9’, ‘24.8’]
[Finished in 20.2s]

When I removed the time.sleep(0.1), the inference time became very short:

GPU
preprocess time: 1.0004043579101562
[‘618.3’, ‘6.0’, ‘5.0’, ‘10.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’]
[Finished in 7.6s]

So my question is why the inference time suddenly becomes slower when there’s a pause in between, and what can we do to prevent this?

Thank you!

Environment

TensorRT Version:We dont use TensorRT but ONNX on CUDA
GPU Type: NVIDIA 2080Ti
Nvidia Driver Version:
CUDA Version: 11.5
CUDNN Version:
Operating System + Version: Windows 10
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

test_trt_short.py (1.8 KB)
model.onnx (10.3 MB)
defective_sample_0001

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Hi,
Can you try running your model with trtexec command, and share the “”–verbose"" log in case if the issue persist
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

You can refer below link for all the supported operators list, in case any operator is not supported you need to create a custom plugin to support that operation

Also, request you to share your model and script if not shared already so that we can help you better.

Meanwhile, for some common errors and queries please refer to below link:

Thanks!

Hi,

Based on the script you’ve shared, looks like it’s related to onnxruntime and not related to tensorrt.
If yes, we recommend you to please post your concern on Issues · microsoft/onnxruntime · GitHub to get better help.

Thank you.

Hi,
I met the same problem, and I’m using tensorrt c++ sdk, not onnxruntime, so I’m pretty sure it’s related to tensorrt.
When putting a pause in between inference, the inference time became very much slower.

for (int i = 0; i < 10000; i++)
    {
        auto t1 = high_resolution_clock::now();
        context->executeV2(bindings);
        // context->enqueueV2(bindings, stream, nullptr);
        // cudaStreamSynchronize(stream);
        auto t2 = high_resolution_clock::now();
        cout << i << " time = " << duration_cast<microseconds>(t2 - t1).count() / 1000.0 << " ms" << endl;
        _sleep(100);
    }

Thanks!

Getting the same issue with hifigan vocoder onnx fp16 model. Unable to find out the issue. Has anyone found any fix/solution related to this?

Sorry guys no clue yet