GTC 2020: Integer Quantization for DNN Inference Acceleration

GTC 2020 S22075
Presenters: Patrick Judd,NVIDIA
Abstract
While neural networks are typically trained in floating-point formats, inference/serving can often use integer arithmetic after neural networks are quantized. Benefits of quantized inference include reduced memory requirements, as well as the use of faster math pipelines. For example, NVIDIA’s Tensor Cores provide int8, int4, and int1 math units, which have 4x, 8x, and 32x more math bandwidth than fp32. We’ll detail various options for quantizing a neural network for inference while maintaining model accuracy. We’ll review results for networks trained for various tasks (computer vision, language, speech) and varying architectures (CNNs, RNNs, Transformers).

Watch this session
Join in the conversation below.