The threshold of quantization

Hi,

(1)If the distribution of FP32 values are asymmetric(Figure 1), the dynamic range of TRT should be symmetric and the thresholds should be the style like -T1 and T1?

(2)If the distribution of FP32 values are symmetric, but its middle value is not 0(Figure 2), the range still should be symmetric like -T2 and T2?

(3)If the answers of above both cases are yes, they will lost accuracy significantly after quantization, is it?

When the sample sample_int8 mnist was executed, the CalibrationTable file was generated:
TRT-6001-EntropyCalibration2
data: 3c008912
conv1: 3c88edfc
pool1: 3c88edfc
conv2: 3ddc858b
pool2: 3ddc858b
(Unnamed Layer* 4) [Fully Connected]_output: 3dc6a9ed
ip1: 3db6bd6e
ip2: 3e691968
prob: 3c010a14

(4)Did all layers have been processed(calibrated) actually? or only above layers were processed?
(5)If part of layers were selected to process, were they selected by manual(configure)? or by TRT(automatic)?

For (1), (2) and (3), TensorRT’s range will always be symmetric. It is possible that this will result in loss of information. This is discussed a bit more here: https://devtalk.nvidia.com/default/topic/1067253/int8-calibration-calculation-of-kullback-leiber-divergence/.

Two workarounds in this scenario are to either, manually set the min/max range if you know their expected values (https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/c_api/classnvinfer1_1_1_i_tensor.html#a956f662b1d2ebe7ba3aba3391aedddf5) – though I still believe this will create a symmetric range based on the min/max values you provide – or to use quantization-aware training (QAT) when training your model, and then convert your model to TensorRT with the EXPLICIT_PRECISION flag: https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#work-with-qat-networks

For (4) and (5), calibration scales will be generated for layers that were chosen by TensorRT to convert to INT8. Sometimes layers are not chosen because either there is no supported INT8 implementation, or TensorRT’s timing tactics decided that the INT8 implementation was actually slower than the FP16 or FP32 implementation in this case. More information about when tactics are chosen are not can be seen by setting TensorRT’s logger severity to VERBOSE: https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/c_api/classnvinfer1_1_1_i_logger.html