GTC 2020: Toward INT8 Inference: Deploying Quantization-Aware Trained Networks using TensorRT

GTC 2020 S21664
Presenters: Dheeraj Peri,NVIDIA; Jhalak Patel,NVIDIA
Abstract
We’ll describe how TensorRT can optimize the quantization ops and demonstrate an end-to-end workflow for running quantized networks. Accelerating deep neural networks (DNN) is a critical step in realizing the benefits of AI for real-world use cases. The need to improve DNN inference latency has sparked interest in lower precision, such as FP16 and INT8 precision, which offer faster inference time. Two prevalent techniques to convert FP32 DNNs to INT8 precision are post-training quantization and quantization-aware training (QAT). TensorRT, a platform for high-performance deep learning inference, supports post-training quantization by performing calibration on the trained model, which quantizes the weights and activations. However, in some cases post-training quantization can degrade accuracy when converting a FP32 model to its INT8 counterpart. QAT introduces quantization ops to achieve higher accuracy by simulating the process for lower-precision quantization during training.

Watch this session
Join in the conversation below.