I’m trying to run a quantized object detection model (tiny-YOLO-v2) on a chip with NVDLA-small configuration (no floating point, only int8). This model has convolution layers with bias. The result is incorrect and there seems to be some overflow-related issue in these bias layers after convolutions.
I have a few questions about the quantization of the bias layer in NVDLA that I didn’t find answer for in the documentation.
NVDLA computes the intermediate tensor output (accumulation) from convolution layer as int32, but bias values are stored as int16. There must be a step where these int16 biases are added to the int32 accumulations, and my understanding is that the int16 values are up-cast to int32 then added to the accumulation; is that correct?
The NVDLA compiler expects an output scale for each layer that I must provide. I’m using the exact same scale for convolution layer and its bias layer. Is that the expected input, or should I create different scales for conv and its bias?
In the conversion from int32 accumulation to int8 (to be fed to the next layer), there is a scale that is multiplied into the accumulation, which is generally a floating-point number. How is that done on NVDLA-small which supposedly doesn’t have FP capability?
Finally, does anything I mentioned above I’ve been doing seem like a problem that can lead to overflow?
Thank you very much for guidance.
Hardware: a custom NVDLA chip with RISC-V CPU
DNN framework: PyTorch (1.5)
Quantization framework: Intel Distiller