YoloV4 slower in INT8 than FP16

Description

Building my custom YoloV4 608x608 model in INT8 us slower than in FP16 on both my xavier nx and also on my 2080Ti.

For example on the 2080Ti I get:
FP16: 13ms per frame
INT8: 19ms per frame

Varying aspects of the INT8 calibration etc makes no difference to the speed.

Is this normal ?

Environment

TensorRT Version: 7.2.1.6
GPU Type: RTX 2080Ti
Nvidia Driver Version: 440.82
CUDA Version: 10.2
CUDNN Version: 8.2.0.53
Operating System + Version: OpenSUSE
Python Version (if applicable): N/A (c++)
TensorFlow Version (if applicable): N/A (c++)
PyTorch Version (if applicable): N/A (c++)
Baremetal or Container (if container which image + tag):

Alt. Environment: xavier nx, jetpack 4.4 - all default versions

Hi, Please refer to the below links to perform inference in INT8
https://github.com/NVIDIA/TensorRT/blob/master/samples/opensource/sampleINT8/README.md

Thanks!

Thanks - I have seen these resources and already followed them on my journey to this point.

I am dealing with a specific issue of INT8 speed, and all of these materials report speed ups for INT8 against FP32.
My query is specific to FP16 vs. INT8 for a specific, yet mainstream architecture (YoloV4) and on two specific GPU platforms where I see the same behaviour (xavier nx and RTX 2080 Ti).

Can you advise if what I am observing is normal and is what is expected ?

Thanks.

Hi @toby.breckon,

It’s possible if many layers end up falling back to fp32. You’d probably want to enable both int8 and fp16 in such case.

Thank you.

Thanks @spolisetty - so my impression from all the documentation was that INT8 quantisation forced all layers to INT8 at the expense of performance which is reliant on how well the distribution (dynamic range) of the INT8 quantised layers approximated that of the original FP32 layers.

From what you are saying, you imply that some layers will remain at FP32 (or FP16 if selected) if quantisation at INT8 is a poor approximation to the original FP32. In this case the network would perform inference in a mix of FP32 and INT8. Where is this discussed in the documentation, as I seem to have missed it ?

If this is the case:

  1. How do I easily cycle through the final TRT network to tell which layers are FP32, INT8 … etc ?
  2. How do I control the criteria for the decision “INT8 poor approximation to the original FP32 for this later → don’t use INT8”, and hence force INT8 quantisation for additional/all layers ?

I have not seen either concept in the various samples.

Many thanks,

T.

Hi @toby.breckon,

(1) This should be part of the verbose logs in the builder.
(2) The decision is purely based on perf, not accuracy. The user can force int8 by enabling the STRICT_TYPES builder flag or by using an algorithm selector.
For your reference,
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/sampleAlgorithmSelector

Thank you.