Confused about the design concept of Explicit quantization Q/DQ node in pytorh_quantizaiton toolkit

872045638 · April 24, 2022, 12:47pm

Description

I am very confused about the design concept of the Q/DQ node during QAT. Take the picture below as example, now that TensorRT has the implementation of int8 conv op,

why we cannot quantize weights and activations to int8 then use int8 conv op directly during explicit quantization?

and why we must use DQ to restore precision? If we act like this, take number 3 as example, is it like that we don’t use integer 3 but use float 3.0 instead? What’s the meaning of DQ node?

And one more question，why the int8 implementation is faster than fp16, is it because that the int8 has faster engine?

I would very appreciate if someone can give me some explanations!

NVES · April 24, 2022, 1:07pm

Hi, Please refer to the below links to perform inference in INT8

Thanks!

872045638 · April 25, 2022, 2:05am

Thanks for your reply! But it seems that it doesn’t answer my question. Do you mean that the inference time calculated by trtexec command

trtexec --onnx=xxx.onnx  --saveEngine=tmp.trt  --int8

and the link you give is different? The thing bothers me a lot is that why we need to use DQ in the pytorch_quantization package to restore precision to float? Thanks

872045638 · April 27, 2022, 4:08am

And what’s the difference between trtexec and the tensorrt c++ api you give? Looking forward to your reply

spolisetty · April 27, 2022, 5:26pm

Hi,

Quantization refers to techniques for performing computations and storing tensors at lower bit widths than floating-point precision. A quantized model executes some or all of the operations on tensors with integers rather than floating-point values. So we need inputs as floating points.

QAT is a quantization method used in conjunction with training. I.e. it is used in torch and TF. These frameworks use a fake-quantization function. When you want to export these models to ONNX for TRT consumption the fake-quantization is converted to a pair of Q-DQ operators. Pytorch’s quantization also supports the fusion of some nodes with fake-quantization, but ONNX does not support these fused nodes so these models can’t be exported to ONNX.

Thank you.

spolisetty · April 27, 2022, 5:30pm

C++ API helps to programmatically load the model, build the TRT engine, and perform inference.
Please refer the below for trtexec:

Topic		Replies	Views
Explicit quantization vs implicit quantization TensorRT	3	1847	April 26, 2022
TRT8 - PTQ using integrated Q\DQ nodes inside the PyTorch model (Explicit) Vs. PTQ using calibration based IInt8EntropyCalibrator2 (Implicit)) TensorRT	3	950	December 13, 2021
QAT int8 TRT engine slower than fp16 TensorRT tensorrt , pytorch , python , onnx	3	2253	January 6, 2022
Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT Technical Blog	1	835	December 3, 2023
Practical aspects about neural networks quantization with TensorRT TensorRT tensorrt	1	800	March 31, 2023
Accelerating Quantized Networks with the NVIDIA QAT Toolkit for TensorFlow and NVIDIA TensorRT Technical Blog	0	393	June 16, 2022
Post quantization aware training is slower than fp16 and post quantization TensorRT	12	2650	September 25, 2024
TensorRT 8-bit Quantization questions TensorRT	7	4817	April 26, 2018
How to using DLA gracefully with Int8 in TRT8 TensorRT tensorrt , dla	5	1240	December 20, 2023
TensorRT the inference is slow for the QAT model comparing to the PTQ case Jetson AGX Xavier tensorrt , nvbugs	19	1586	January 16, 2023

Confused about the design concept of Explicit quantization Q/DQ node in pytorh_quantizaiton toolkit

Description

Related topics