NVDLA INT8 Intermediate Layer Scaling


The NVDLA documentation doesn’t clearly describe how the scaling converters need to be programmed for INT8 quantized DNN inference. My question/confusion specifically is: How are scales (i.e., calibration table) computed for passing to the NVDLA compiler? The documentation recommends using TensorRT but doesn’t mention exactly what the scale means. This is my understanding. Consider:

quantizedLayerInput = S1 * Input
quantizedWeights = S2 * W
resultTensor = S1 * S2 * R    
INT8ResultTensor = R * S3 / (S1 * S2) // S3 computed from layer output distribution

Each scale is computed as the following:

S_dist = 256 / (dist_max - dist_min)

If this understanding is correct, the scale passed to the NVDLA compiler should be:

S3 / (S1 * S2)

Guidance is very much appreciated.


scale = max(abs(min, max))/127
1.0/scaleFactor means one unit of float point can present in the int8 domain.
Int8 = abs(float/scale factor)
If the int8 abs(value) > 128, it will be truncated.

For example, considering HEX value (3f556f06) in calibration table as big endian:
Up_sample_6/conv2d_25/Relu: 3f556f06
127 * 0.833725333
Please refer to below link:

Thanks very much for your reply. This makes sense, I was also realized that scale has to be computed this way to enable zero-centered quantization. The other related question is how the scale is to be used for the intermediate output of each layer (conv, dense) in NVDLA. There are 3 scales at hand: scale_weights, scale_inputs, scale_outputs. My understanding is that the scale we need to convert the INT32/FLOAT intermediate output of weight * Inputs (after quantized to INT8), back to INT8 require the outputs to be scaled with:

scale_outputs / (scale_inputs * scale_weights).

Does that make sense?

If this is int8 inference, the output would be int8.
And if you wants fp32 output, just * output_scale.