YoloV4 slower in INT8 than FP16

toby.breckon · May 28, 2021, 12:18am

Description

Building my custom YoloV4 608x608 model in INT8 us slower than in FP16 on both my xavier nx and also on my 2080Ti.

For example on the 2080Ti I get:
FP16: 13ms per frame
INT8: 19ms per frame

Varying aspects of the INT8 calibration etc makes no difference to the speed.

Is this normal ?

Environment

TensorRT Version: 7.2.1.6
GPU Type: RTX 2080Ti
Nvidia Driver Version: 440.82
CUDA Version: 10.2
CUDNN Version: 8.2.0.53
Operating System + Version: OpenSUSE
Python Version (if applicable): N/A (c++)
TensorFlow Version (if applicable): N/A (c++)
PyTorch Version (if applicable): N/A (c++)
Baremetal or Container (if container which image + tag):

Alt. Environment: xavier nx, jetpack 4.4 - all default versions

NVES · May 28, 2021, 12:37am

Hi, Please refer to the below links to perform inference in INT8
https://github.com/NVIDIA/TensorRT/blob/master/samples/opensource/sampleINT8/README.md

Thanks!

toby.breckon · May 28, 2021, 12:46am

Thanks - I have seen these resources and already followed them on my journey to this point.

I am dealing with a specific issue of INT8 speed, and all of these materials report speed ups for INT8 against FP32.
My query is specific to FP16 vs. INT8 for a specific, yet mainstream architecture (YoloV4) and on two specific GPU platforms where I see the same behaviour (xavier nx and RTX 2080 Ti).

Can you advise if what I am observing is normal and is what is expected ?

Thanks.

spolisetty · May 28, 2021, 4:56pm

Hi @toby.breckon,

It’s possible if many layers end up falling back to fp32. You’d probably want to enable both int8 and fp16 in such case.

Thank you.

toby.breckon · May 28, 2021, 11:29pm

Thanks @spolisetty - so my impression from all the documentation was that INT8 quantisation forced all layers to INT8 at the expense of performance which is reliant on how well the distribution (dynamic range) of the INT8 quantised layers approximated that of the original FP32 layers.

From what you are saying, you imply that some layers will remain at FP32 (or FP16 if selected) if quantisation at INT8 is a poor approximation to the original FP32. In this case the network would perform inference in a mix of FP32 and INT8. Where is this discussed in the documentation, as I seem to have missed it ?

If this is the case:

How do I easily cycle through the final TRT network to tell which layers are FP32, INT8 … etc ?
How do I control the criteria for the decision “INT8 poor approximation to the original FP32 for this later → don’t use INT8”, and hence force INT8 quantisation for additional/all layers ?

I have not seen either concept in the various samples.

Many thanks,

T.

spolisetty · June 5, 2021, 12:02pm

Hi @toby.breckon,

(1) This should be part of the verbose logs in the builder.
(2) The decision is purely based on perf, not accuracy. The user can force int8 by enabling the STRICT_TYPES builder flag or by using an algorithm selector.
For your reference,
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/sampleAlgorithmSelector

Thank you.

Topic		Replies	Views
Why is' int8 'not as fast as' fp16' TensorRT tensorrt	1	633	February 1, 2021
The inference speed of yolov5 tensorrt has little difference between int8 and fp16 TensorRT tensorrt , cuda	1	1620	September 8, 2022
TRT Engin in INT8 is much slower than FP16 TensorRT	4	2093	November 11, 2021
Same inference speed for INT8 and FP16 TensorRT	10	6239	October 12, 2021
YoloV4 int8 conversion issue TensorRT tensorrt	1	562	January 11, 2022
Little performance difference between int8 and fp16 on RTX2080 TensorRT	4	2736	July 5, 2021
Yolov3 int8 on tensorrt 7.1.0.16 Jetson Xavier NX tensorrt	4	931	October 18, 2021
QAT int8 TRT engine slower than fp16 TensorRT tensorrt , pytorch , python , onnx	3	2481	January 6, 2022
Yolov3 INT8 performance same with FP16 DeepStream SDK	2	514	October 12, 2021
Int8 is not faster than fp16 on xavier Jetson AGX Xavier tensorrt	5	851	October 18, 2021

YoloV4 slower in INT8 than FP16

Description

Environment

Related topics