Explicit quantization vs implicit quantization

Description

I am confused about why I can not use the calibration table contained in the QAT ONNX model(explicit quantization) and then use tensorrt internal quantization(implicit quantization)? Can someone help me?

Environment

TensorRT Version: 7.0
GPU Type: v100

Hi,

Request you to share the model, script, profiler, and performance output if not shared already so that we can help you better.

Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer to the below links for more details:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#measure-performance

https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#model-accuracy

Thanks!

Thanks for your reply!
This is the resnet18 onnx model(implicit quantization)
resnet18.onnx (42.6 MB)
This is the quantized resnet18 onnx model exported by pytorch_quantization package(explicit quantization)
resnet18_quant.onnx (42.7 MB)

this is the command I use

trtexec --onnx=xxx.onnx  --saveEngine=tmp.trt   --iterations=10000  --int8

the results shows that the speed of the explicit quantization(mean GPU time 2.2ms) is much slower than implicit(0.9ms).

And my question is that why TensorRT cannot use calibration info in the explicit quantization model to perform like implicit quantization, instead, must use Q/DQ node, which is slower than implicit quantization?

In other word, why the ptq model exported from pytorch_quantization cannot perform like trt internal ptq( plain TensorRT INT8 processing )

And why we cannot remove the q/dq layer of the explicit quantization model then use trt internal ptq

Hi,

Hope the following doc will help you. We can find clear details on Explicit vs Implicit Quantization

Thank you.