First inference after a pause is always long

hnamletran · January 19, 2022, 4:11am

Description

Coming from this topic: https://forums.developer.nvidia.com/t/inference-time-becomes-longer-when-doing-non-continuous-fp16-or-int8-inference/184127

We have a loop of inference using ONNX built on CUDA (the same happened with ONNX/TensorRT tensorflow/CUDA, or anything on top of CUDA). If there is no pause in between the inference, the inference time is very stable around 6-7ms. However if we put a pause in between the inference, the inference time shot to a few hundred milliseconds.

The sciprt, model and sample image are attached

output:

GPU
preprocess time: 0.0
[‘602.5’, ‘6.0’, ‘5.7’, ‘6.0’, ‘6.0’, ‘6.0’, ‘7.0’, ‘6.7’, ‘12.9’, ‘11.8’, ‘7.0’, ‘7.9’, ‘7.7’, ‘12.8’, ‘12.8’, ‘15.9’, ‘13.0’, ‘13.0’, ‘14.8’, ‘13.8’, ‘22.0’, ‘21.8’, ‘23.9’, ‘23.8’, ‘24.8’, ‘25.0’, ‘23.9’, ‘26.0’, ‘26.0’, ‘23.8’, ‘24.0’, ‘25.8’, ‘25.0’, ‘27.1’, ‘25.0’, ‘27.0’, ‘25.8’, ‘27.8’, ‘27.1’, ‘26.0’, ‘27.0’, ‘28.1’, ‘26.8’, ‘24.8’, ‘25.8’, ‘25.8’, ‘26.0’, ‘26.9’, ‘26.8’, ‘27.0’, ‘27.0’, ‘24.8’, ‘26.8’, ‘27.8’, ‘26.8’, ‘26.0’, ‘27.0’, ‘25.0’, ‘24.8’, ‘27.0’, ‘24.8’, ‘27.0’, ‘27.0’, ‘27.1’, ‘25.9’, ‘24.9’, ‘27.8’, ‘27.0’, ‘27.0’, ‘27.8’, ‘26.8’, ‘27.0’, ‘27.0’, ‘24.7’, ‘25.0’, ‘28.1’, ‘26.0’, ‘26.9’, ‘24.7’, ‘24.8’, ‘25.0’, ‘26.8’, ‘27.0’, ‘27.0’, ‘26.0’, ‘29.0’, ‘25.0’, ‘25.0’, ‘24.7’, ‘28.1’, ‘28.0’, ‘27.0’, ‘26.0’, ‘25.0’, ‘27.0’, ‘26.9’, ‘27.9’, ‘26.8’, ‘27.9’, ‘24.8’]
[Finished in 20.2s]

When I removed the time.sleep(0.1), the inference time became very short:

GPU
preprocess time: 1.0004043579101562
[‘618.3’, ‘6.0’, ‘5.0’, ‘10.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘6.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’, ‘5.0’, ‘5.0’, ‘5.0’, ‘6.0’, ‘5.0’]
[Finished in 7.6s]

So my question is why the inference time suddenly becomes slower when there’s a pause in between, and what can we do to prevent this?

Thank you!

Environment

TensorRT Version:We dont use TensorRT but ONNX on CUDA
GPU Type: NVIDIA 2080Ti
Nvidia Driver Version:
CUDA Version: 11.5
CUDNN Version:
Operating System + Version: Windows 10
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

test_trt_short.py (1.8 KB)
model.onnx (10.3 MB)
defective_sample_0001

Steps To Reproduce

Please include:

Exact steps/commands to build your repro
Exact steps/commands to run your repro
Full traceback of errors encountered

NVES · January 19, 2022, 4:38am

Hi,
Can you try running your model with trtexec command, and share the “”–verbose"" log in case if the issue persist
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

You can refer below link for all the supported operators list, in case any operator is not supported you need to create a custom plugin to support that operation

github.com

onnx/onnx-tensorrt/blob/main/docs/operators.md

<!--- SPDX-License-Identifier: Apache-2.0 -->

# Supported ONNX Operators

TensorRT 8.4 supports operators up to Opset 17. Latest information of ONNX operators can be found [here](https://github.com/onnx/onnx/blob/master/docs/Operators.md)

TensorRT supports the following ONNX data types: DOUBLE, FLOAT32, FLOAT16, INT8, and BOOL

> Note: There is limited support for INT32, INT64, and DOUBLE types. TensorRT will attempt to cast down INT64 to INT32 and DOUBLE down to FLOAT, clamping values to `+-INT_MAX` or `+-FLT_MAX` if necessary.

See below for the support matrix of ONNX operators in ONNX-TensorRT.

## Operator Support Matrix

| Operator                  | Supported  | Supported Types | Restrictions                                                                                                           |
|---------------------------|------------|-----------------|------------------------------------------------------------------------------------------------------------------------|
| Abs                       | Y          | FP32, FP16, INT32 |
| Acos                      | Y          | FP32, FP16 |
| Acosh                     | Y          | FP32, FP16 |
| Add                       | Y          | FP32, FP16, INT32 |

This file has been truncated. show original

Also, request you to share your model and script if not shared already so that we can help you better.

Meanwhile, for some common errors and queries please refer to below link:

Thanks!

spolisetty · January 25, 2022, 9:22am

Hi,

Based on the script you’ve shared, looks like it’s related to onnxruntime and not related to tensorrt.
If yes, we recommend you to please post your concern on Issues · microsoft/onnxruntime · GitHub to get better help.

Thank you.

liurunze · June 30, 2022, 2:25am

Hi,
I met the same problem, and I’m using tensorrt c++ sdk, not onnxruntime, so I’m pretty sure it’s related to tensorrt.
When putting a pause in between inference, the inference time became very much slower.

for (int i = 0; i < 10000; i++)
    {
        auto t1 = high_resolution_clock::now();
        context->executeV2(bindings);
        // context->enqueueV2(bindings, stream, nullptr);
        // cudaStreamSynchronize(stream);
        auto t2 = high_resolution_clock::now();
        cout << i << " time = " << duration_cast<microseconds>(t2 - t1).count() / 1000.0 << " ms" << endl;
        _sleep(100);
    }

Thanks!

user66817 · August 4, 2022, 3:07am

Getting the same issue with hifigan vocoder onnx fp16 model. Unable to find out the issue. Has anyone found any fix/solution related to this?

hnamletran · August 4, 2022, 3:26am

Sorry guys no clue yet

Topic		Replies	Views
Inference Time is not stable TensorRT	10	1838	January 3, 2019
Inference time becomes longer when doing non-continuous fp16 or int8 inference TensorRT tensorrt , jetson-inference	33	3481	March 30, 2023
TensorRt inference is taking 1.5 sec to inference a single frame.i want to speed up my inference TensorRT tensorrt , jetson-inference , jetson-nano	1	948	March 13, 2023
TensorRT inference time much slower than cuDNN TensorRT	3	2061	October 12, 2021
Why my inference time is so long when using trtexec - FP16? Jetson TX2 jetson-inference	4	2041	October 18, 2021
Pytorch to Tensorrt speedup anomalies Deep Learning (Training & Inference) mixed-precision	0	2193	November 27, 2019
Inference time changes after training TensorRT tensorrt	5	634	September 25, 2020
Low Compute utilization of converted TensorFlow model during inference Jetson TX2	19	1821	October 18, 2021
ONNX runtime prediction using GPU and with different intervals TensorRT	4	2174	January 19, 2022
TensorRT inference Time TensorRT	1	793	September 20, 2018