Bias Layers Quantization in NVDLA

Background

I’m trying to run a quantized object detection model (tiny-YOLO-v2) on a chip with NVDLA-small configuration (no floating point, only int8). This model has convolution layers with bias. The result is incorrect and there seems to be some overflow-related issue in these bias layers after convolutions.

Description

I have a few questions about the quantization of the bias layer in NVDLA that I didn’t find answer for in the documentation.

  1. NVDLA computes the intermediate tensor output (accumulation) from convolution layer as int32, but bias values are stored as int16. There must be a step where these int16 biases are added to the int32 accumulations, and my understanding is that the int16 values are up-cast to int32 then added to the accumulation; is that correct?

  2. The NVDLA compiler expects an output scale for each layer that I must provide. I’m using the exact same scale for convolution layer and its bias layer. Is that the expected input, or should I create different scales for conv and its bias?

  3. In the conversion from int32 accumulation to int8 (to be fed to the next layer), there is a scale that is multiplied into the accumulation, which is generally a floating-point number. How is that done on NVDLA-small which supposedly doesn’t have FP capability?

Finally, does anything I mentioned above I’ve been doing seem like a problem that can lead to overflow?

Thank you very much for guidance.

Best,
Yifan

Environments

Hardware: a custom NVDLA chip with RISC-V CPU

DNN framework: PyTorch (1.5)

Quantization framework: Intel Distiller

Hi, Please refer to the below links to perform inference in INT8
https://github.com/NVIDIA/TensorRT/blob/master/samples/opensource/sampleINT8/README.md

Thanks!