Inference time becomes longer when doing non-continuous fp16 or int8 inference

Description

Inference time becomes longer when doing “non-continuous” fp16 or int8 inference.

  • “non-continuous” inference means that doing pre-process and cudaMemcpyHostToDevice for different inputs each time before inference
  • “continuous” inference means that doing pre-process and cudaMemcpyHostToDevice only once for the same input. After that “continuous” doing the inference for N=1000 times.
  • Issue can’t be reproduced when doing “non-continuous” fp32 inference.
  • Issue can’t be reproduced when doing “continuous” fp16 or int8 inference.

I think the issue is not model correlated.
Please let me know if you cannot reproduce the issue.

Environment

Xavier
TensorRT Version : 7.1.3-1
CUDA Version : cuda10.2
CUDNN Version : 8.0

Hi, Please refer to the below links to perform inference in INT8

Thanks!

Thanks for the reply.
But I have no problem about the INT8 calibration.
This topic concerns about the inference performance.
Not only INT8 but also FP16 has the same issue.

I see there is “warm-up” mechanism in some topics.

Is it the reason that slow down the “non-continuous” inference?
Would you please explain more about “warm-up” mechanism? For example, under what conditions TensorRT will “warm-up” again.

By the way I use FP32 input for both fp32 and fp16/int8 models. Please let me know if it is a problem.

@zhaofengming.zfm,

I believe they are referring to do a few warm-up runs of common.do_inference_v2() before starting the timing. The very first run usually takes a long time in setting up stuff.

Thank you.

@spolisetty

Thanks for the reply. I see what you mean.

And I found there is same “warm-up” process in trtexec module when doing time measurement, refer to inferenceLoop function of /usr/src/tensorrt/samples/common/sampleInference.cpp

I totally understand that,
in order to obtain better and more accurate measurement results, it’s better to do a few warm-up before starting the timing, because “The very first run usually takes a long time in setting up stuff.”

But I think these scenarios are all “continuous” inference scenarios which I mentioned in this topic. After a few warm-up I can get the best performance when doing “continuous” inference.

In the case of “non-continuous” inference scenarios (the time between two inference will be 10ms or 20ms depending on preprocess time on each input), I observed that the “The very first run” and “warm-up” effects appeared again and again.

And I also observed that when doing two inference parallelly, it can reduce the “The very first run” and “warm-up” effects. It seems that the two models using the GPU/TensorRT alternately, so the GPU/TensorRT didn’t “cold-down”, thus didn’t need “warm-up” again.

So would you please let me know under what conditions (for example within xx ms that GPU/TensorRT is not working) TensorRT or GPU will “warm-up” again?

Hi @zhaofengming.zfm,

Sorry for the delayed response. It’s hard to provide suggestions based on this info, we recommend you to please share minimal issue repro scripts/model for better debugging.

Thank you.

Hi @spolisetty

profiler.7z (1.9 KB)

Please check the attachment.

I cannot upload the model file due to the size limitation.
Get it from here:GitHub - ultralytics/yolov5: YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite

I think the issue is not model correlated.
You can use any model to reproduce it.

yolov5.diff (447 Bytes)

I met some issue when using the original script.
Apply the attachment patch before excuting the following script.
python3 export.py --weights yolov5s.pt

Hi,

Have you observed same issue when you try using trtexec, if yes could you share both verbose logs.

No.
trtexec do “continuous” inference additionally skip serveral premier frames so I haven’t seen any delay effect.

Could you please share minimum issue repro script you’re using.

Thank you.

It had been shared on Aug 3. Please check the above reply.

Sorry we couldn’t find inference script you’re using in 7z file you’ve shared. Please provide us inference script with sample to reproduce the issue.

Thank you.

profiler.7z.001 (10 MB)

profiler.7z.002 (10 MB)

profiler.7z.003 (10 MB)

profiler.7z.004 (4.7 MB)

Hi @spolisetty

Please download profiler.7z.001 - profiler.7z.004 and uncompress it.
cd profiler/build
./profiler