TRT8 - PTQ using integrated Q\DQ nodes inside the PyTorch model (Explicit) Vs. PTQ using calibration based IInt8EntropyCalibrator2 (Implicit))

Description

I am wondering which one of two Quantization techniques Explicit vs Implicit shall provide better fps in case both of them operate on the same original PyTorch model and inference on the same system (GPU, CPU, OS, Python etc.)

Is there an absolute answer in case both techniques operate on the same PyTorch model and environment or per each individual PyTorch model there can be different results despite the fact that both of them inference on the same environment?

I have a test case which use both Quantization techniques to generate the two quantized models from the same PyTorch model. When I inference the same test set I’m getting that the Implicit quantized model achieve greater fps than what is achieved using the Explicit quantized model.
The predictions accuracies of both of them are almost equal with negligible gap.

TensorRT Version : 8.2.0.6 Python version
GPU Type : Quadro RTX 3000
Nvidia Driver Version : R471.11 (r470_96-9) / 27.20.100.8190 (5-5-2020)
CUDA Version : 11.2
CUDNN Version : 8.1.1
Operating System + Version : Windows 10
Python Version (if applicable) : NA
TensorFlow Version (if applicable) : NA
PyTorch Version (if applicable) : 1.9
Baremetal or Container (if container which image + tag) : Baremetal
ONNX IR version : 8
Opset version : 15

Regards,

Hi, Please refer to the below links to perform inference in INT8

Thanks!

Hello,
Thanks for the response.
I will try to clarify:
I have all the knowledge and experience how to correctly implement Int8 inference with TRT with c++ and Python.
Based on my previous experience I prepared two TRT Int8 inferences from one pretrained PyTorch model using two techniques:

  1. TRT calibration (Implicit) - After onnx was exported from the original PyTorch fp32 bit

  2. TRT Q\DQ (Explicit) nodes integrated inside the original PyTorch fp32 bit before export it to onnx

The experience purpose was to check which technique achieve better TRT engine file inference accuracy and greater fps in case both of them was generated from the same PyTorch model.

My experience results show that both techniques achieve the same inference accuracy results but the first one achieve much more greater fps result.

So I wonder if it make sense?!
In general case, is it true to say that in case both techniques operated on the same PyTorch model, the first one will always achieve a greater fps?
Or is it not true and now I shall investigate to find the root cause of the fps gap?!

For me it make sense why the first technique achieve a greater fps and it is because that in this case the TRT engine file was generated from the original model layers only while in the second case several Q\DQ layers were added to the model which increase the amount of mathematical operations even though it was optimized to TRT engine file.
Am I right?

Regards,

1 Like

Implicit quantization optimizes for performance, and explicit quantization optimizes for performance while maintaining arithmetic precision (accuracy). Because implicit quantization does not have any accuracy optimization constraints, it will run as fast as explicit quantization (or faster). More details in the link you’ve provided above (Developer Guide :: NVIDIA Deep Learning TensorRT Documentation).

Explicit quantization must obey the arithmetic operations defined by the Q/DQ nodes in the graph. This constrains the optimizer. For example, say you’ve defined in your graph Conv2d → Q → DQ → Swish. TRT can’t fuse this into one kernel because it doesn’t have an implementation so we get: QConv2d → DQ → Swish. The QConv2d will output quantized activations and the DQ will feed Swish with float input. This is exactly what the graph arithmetics defines.
In implicit quantization you define Conv2d → Swish. This fuses into a single kernel which will probably execute in int8. Probably, and not definitely, because the graph semantics don’t mandate int8 and that decision will be made by the optimizer’s auto-tuner component, based on measured performance of float vs int8.

1 Like