TRT8 - PTQ using integrated Q\DQ nodes inside the PyTorch model (Explicit) Vs. PTQ using calibration based IInt8EntropyCalibrator2 (Implicit))

orong13 · November 29, 2021, 4:55pm

Description

I am wondering which one of two Quantization techniques Explicit vs Implicit shall provide better fps in case both of them operate on the same original PyTorch model and inference on the same system (GPU, CPU, OS, Python etc.)

Is there an absolute answer in case both techniques operate on the same PyTorch model and environment or per each individual PyTorch model there can be different results despite the fact that both of them inference on the same environment?

I have a test case which use both Quantization techniques to generate the two quantized models from the same PyTorch model. When I inference the same test set I’m getting that the Implicit quantized model achieve greater fps than what is achieved using the Explicit quantized model.
The predictions accuracies of both of them are almost equal with negligible gap.

TensorRT Version : 8.2.0.6 Python version
GPU Type : Quadro RTX 3000
Nvidia Driver Version : R471.11 (r470_96-9) / 27.20.100.8190 (5-5-2020)
CUDA Version : 11.2
CUDNN Version : 8.1.1
Operating System + Version : Windows 10
Python Version (if applicable) : NA
TensorFlow Version (if applicable) : NA
PyTorch Version (if applicable) : 1.9
Baremetal or Container (if container which image + tag) : Baremetal
ONNX IR version : 8
Opset version : 15

Regards,

NVES · November 30, 2021, 7:09am

Hi, Please refer to the below links to perform inference in INT8

Thanks!

orong13 · December 2, 2021, 5:37am

Hello,
Thanks for the response.
I will try to clarify:
I have all the knowledge and experience how to correctly implement Int8 inference with TRT with c++ and Python.
Based on my previous experience I prepared two TRT Int8 inferences from one pretrained PyTorch model using two techniques:

TRT calibration (Implicit) - After onnx was exported from the original PyTorch fp32 bit
TRT Q\DQ (Explicit) nodes integrated inside the original PyTorch fp32 bit before export it to onnx

The experience purpose was to check which technique achieve better TRT engine file inference accuracy and greater fps in case both of them was generated from the same PyTorch model.

My experience results show that both techniques achieve the same inference accuracy results but the first one achieve much more greater fps result.

So I wonder if it make sense?!
In general case, is it true to say that in case both techniques operated on the same PyTorch model, the first one will always achieve a greater fps?
Or is it not true and now I shall investigate to find the root cause of the fps gap?!

For me it make sense why the first technique achieve a greater fps and it is because that in this case the TRT engine file was generated from the original model layers only while in the second case several Q\DQ layers were added to the model which increase the amount of mathematical operations even though it was optimized to TRT engine file.
Am I right?

Regards,

nzmora · December 13, 2021, 2:41pm

Implicit quantization optimizes for performance, and explicit quantization optimizes for performance while maintaining arithmetic precision (accuracy). Because implicit quantization does not have any accuracy optimization constraints, it will run as fast as explicit quantization (or faster). More details in the link you’ve provided above (https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#explicit-implicit-quantization).

Explicit quantization must obey the arithmetic operations defined by the Q/DQ nodes in the graph. This constrains the optimizer. For example, say you’ve defined in your graph Conv2d → Q → DQ → Swish. TRT can’t fuse this into one kernel because it doesn’t have an implementation so we get: QConv2d → DQ → Swish. The QConv2d will output quantized activations and the DQ will feed Swish with float input. This is exactly what the graph arithmetics defines.
In implicit quantization you define Conv2d → Swish. This fuses into a single kernel which will probably execute in int8. Probably, and not definitely, because the graph semantics don’t mandate int8 and that decision will be made by the optimizer’s auto-tuner component, based on measured performance of float vs int8.

Topic		Replies	Views
QAT int8 TRT engine slower than fp16 TensorRT tensorrt , pytorch , python , onnx	3	2253	January 6, 2022
Confused about the design concept of Explicit quantization Q/DQ node in pytorh_quantizaiton toolkit TensorRT	5	894	April 27, 2022
Practical aspects about neural networks quantization with TensorRT TensorRT tensorrt	1	800	March 31, 2023
Performance of QAT YOLOv7 model is worse? TensorRT	16	913	August 3, 2023
How to using DLA gracefully with Int8 in TRT8 TensorRT tensorrt , dla	5	1240	December 20, 2023
TensorRT the inference is slow for the QAT model comparing to the PTQ case Jetson AGX Xavier tensorrt , nvbugs	19	1586	January 16, 2023
Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT Technical Blog	1	835	December 3, 2023
Explicit quantization vs implicit quantization TensorRT	3	1847	April 26, 2022
Tensorrt inferencing getting failed with custom quantized int 8 TensorFlow model TensorRT tensorrt , ubuntu , python , cudnn	1	17	March 28, 2025
Post-Training Quantization (PTQ) for semantic segmentation model running on Jetson Orin NX Jetson Orin NX tensorrt	24	230	March 26, 2025

TRT8 - PTQ using integrated Q\DQ nodes inside the PyTorch model (Explicit) Vs. PTQ using calibration based IInt8EntropyCalibrator2 (Implicit))

Description

Related topics