This file has been truncated. show original
# Low precision support in NVDLA
Use of low precision such 8-bit, 4-bit, or even lower number of bits for inference is one of the optimization methods used in deep learning. It helps to compress the model reducing memory footprint and to improve performance with a small degradation in accuracy. Using INT8 precision for inference requires quantizing pre-trained models from floating point to INT8 and programming converters in NVDLA for scaling/re-scaling tensors.
### NVDLA architecture for INT8 precision support includes the following:
- INT8 input/output data read/write
- 32-bit internal pipeline, avoids saturation in mathematical computations
- Per-tensor input scaling using input converters
- Per-tensor and per-kernel output re-scaling using output converters
### Steps to generate INT8 quantized model:
- Analyze the dynamic range of per-layer tensors and calculate scale factors using TensorRT
- Import scale factors generated using TensorRT to NVDLA JSON format
- Quantize model weights and determine the converter parameters using scale factors
#### Analyze dynamic range of per-layer tensors and calculate scale factors using TensorRT
A calibration tool collects the dynamic range of the output tensor for each layer over a dataset of images. This dynamic range information can be used to calculate per-tensor scale factors. For NVDLA, calibration interface TensorRT is used to generate scale factors.
Refer to https://github.com/NVIDIA/TensorRT/tree/release/5.1/samples/opensource/sampleINT8 for sample application which explains how to use TensorRT to generate scales factors.