INT8 vs FP16 results

eyalhir74 · October 28, 2020, 5:45am

Hi,
I’ve been using the method described in the article below in order to run our network in INT8 instead of FP16. The speedup is really cool, and the visual results (i.e. after I process the network and visualize whats needed) seems to be ok.
However when I start comparing the numerical results between the FP16 and INT8 networks, I see big differences. It seems that the ratio in the numbers is correct, i.e., if the FP16 results contain a sequence of the following numbers, starting from the Xth position: -1.5, 0.34, 0.51, 3.4, -1.7
I’d see a similar sequence in the INT8, but somewhat shifted/scaled/??? such as: -5.2, 0.56, 4.53, -5.1

Is that reasonable? What’s this difference?
Anyone can clarify what exactly happens when the dynamic ranges are set?

https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/sampleINT8API

thanks
Eyal

AastaLLL · October 28, 2020, 9:51am

Hi,

Calibration is required for the INT8 inference.
Have you generated the INT8 cache based on your model and device first?

Thanks.

eyalhir74 · October 28, 2020, 10:01am

Hi @AastaLLL,
This is what the example says:

One way to choose the dynamic range is to use the TensorRT INT8 calibrator. But if you don’t want to go that route (for example, let’s say you used quantization-aware training or you just want to use the min and max tensor values seen during training), you can skip the INT8 calibration and set custom per-network tensor dynamic ranges… Configuring the builder to use INT8 without the INT8 calibrator…

So I didn’t do a Those are the steps I did:

Take a couple of RGB input pictures
Get a histogram of the values of each layer output in FP16.
Take the min/max values that corresponds to ~99% of the possible output values and used those in the setDynamicRange call for each layer.

The .onnx file is the one used for FP16 - I didn’t touch or retrain it.

Maybe I misunderstood something in the process?

thanks
Eyal

AastaLLL · October 29, 2020, 4:27am

Hi,

The dynamic range is used by the quantization for converting the float value into the integer.

Please noted that the dynamic range for float32 (-3.4x10^38 ~ +3.4x10^38) is much larger than int8(-128 ~ +127).
So it’s important to select the correct dynamic range.

The default range is set based on some general classification model.
If your input data is different from the assumption, you can try the calibration to correct the quantization.

Here is a related talk for your reference:

Thanks.

eyalhir74 · October 29, 2020, 4:50am

Hi @AastaLLL,
Thanks for the prompt answer. I’m converting from FP16 still I realize the difference in the FP16 versus the INT8 range.
Based on analyzing each layer’s FP16 output, I believe I set the dynamic range in a reasonable way - usually -10 to +10 and in some layers -50 to +50. The results seems reasonable.

However there is a discrepancy in the whole network output value range.
I thought there might be something I neglected or did wrong…

I’ll check the talk you’ve specified.

thanks
Eyal