TensorRT INT8 Quantization : weights + activations quantization

Hello everyone,

I am running INT8 quanization using TRT5 in top of Tensorflow.
In the presentation of the INT8 quantization they mention that the activations are quantized using the Entropy Calibrator, however, the weights are quantized using min-max quantization.

Question: are the weights of the hole graph (all trainable parameters: batch norm param + biases + kernel weights) are taken into consideration and then we just map the max to 127 and the min to -127?

If yes, can you please explain how this is possible if we have huge values for biases or batch norm parameters?

Thanks,
Fares

Hi,

There are two ways to enable Int8 interface:

  1. Dynamic range - by setting min, max value per layer
  2. Int8 calibration - Implement the IInt8Calibrator interface to provide calibration data to TensorRT

Please refer below link for more details:
https://docs.nvidia.com/deeplearning/sdk/tensorrt-archived/tensorrt-515/tensorrt-developer-guide/index.html#enable_int8_c

Int8 quantization is performed per layer. By default, TensorRT will choose Int8 implementation only if it results in a higher-performance network. If an implementation at a higher precision is faster, TensorRT will use it.
You can override this behavior by making the type constraints strict.
builder->setStrictTypeConstraints(true);
https://docs.nvidia.com/deeplearning/sdk/tensorrt-archived/tensorrt-515/tensorrt-developer-guide/index.html#set_layer_mp_c

TRT is quantizing both weight and activation to INT8 precision, but TRT doesn’t accept quantized weights as input from the user on TRT 5.x.

Thanks

Hello,

Thank you for your answer!

I am using TF-TRT, so TensorRT in top of Tensorflow. Just for testing, not in c++ level.

And my question is how the weights are quantized for INT8 quantization? Is it quantizating the hole weights together or is it doing it for weights layer by layer.

I am only asking about the weights, and not the activations.

Thanks.

Hi,

During INT8 quantization both weights and activations are quantized on per layer basis.

Thanks

Are the quantization then hardware/os specific or are they portable?