Inference time becomes longer when doing non-continuous fp16 or int8 inference

Description

Inference time becomes longer when doing “non-continuous” fp16 or int8 inference.

  • “non-continuous” inference means that doing pre-process and cudaMemcpyHostToDevice for different inputs each time before inference
  • “continuous” inference means that doing pre-process and cudaMemcpyHostToDevice only once for the same input. After that “continuous” doing the inference for N=1000 times.
  • Issue can’t be reproduced when doing “non-continuous” fp32 inference.
  • Issue can’t be reproduced when doing “continuous” fp16 or int8 inference.

I think the issue is not model correlated.
Please let me know if you cannot reproduce the issue.

Environment

Xavier
TensorRT Version : 7.1.3-1
CUDA Version : cuda10.2
CUDNN Version : 8.0

Hi, Please refer to the below links to perform inference in INT8

Thanks!

Thanks for the reply.
But I have no problem about the INT8 calibration.
This topic concerns about the inference performance.
Not only INT8 but also FP16 has the same issue.

I see there is “warm-up” mechanism in some topics.

Is it the reason that slow down the “non-continuous” inference?
Would you please explain more about “warm-up” mechanism? For example, under what conditions TensorRT will “warm-up” again.

By the way I use FP32 input for both fp32 and fp16/int8 models. Please let me know if it is a problem.

@zhaofengming.zfm,

I believe they are referring to do a few warm-up runs of common.do_inference_v2() before starting the timing. The very first run usually takes a long time in setting up stuff.

Thank you.

@spolisetty

Thanks for the reply. I see what you mean.

And I found there is same “warm-up” process in trtexec module when doing time measurement, refer to inferenceLoop function of /usr/src/tensorrt/samples/common/sampleInference.cpp

I totally understand that,
in order to obtain better and more accurate measurement results, it’s better to do a few warm-up before starting the timing, because “The very first run usually takes a long time in setting up stuff.”

But I think these scenarios are all “continuous” inference scenarios which I mentioned in this topic. After a few warm-up I can get the best performance when doing “continuous” inference.

In the case of “non-continuous” inference scenarios (the time between two inference will be 10ms or 20ms depending on preprocess time on each input), I observed that the “The very first run” and “warm-up” effects appeared again and again.

And I also observed that when doing two inference parallelly, it can reduce the “The very first run” and “warm-up” effects. It seems that the two models using the GPU/TensorRT alternately, so the GPU/TensorRT didn’t “cold-down”, thus didn’t need “warm-up” again.

So would you please let me know under what conditions (for example within xx ms that GPU/TensorRT is not working) TensorRT or GPU will “warm-up” again?