I am confused about why I can not use the calibration table contained in the QAT ONNX model(explicit quantization) and then use tensorrt internal quantization(implicit quantization)? Can someone help me?
Thanks for your reply!
This is the resnet18 onnx model(implicit quantization) resnet18.onnx (42.6 MB)
This is the quantized resnet18 onnx model exported by pytorch_quantization package(explicit quantization) resnet18_quant.onnx (42.7 MB)
the results shows that the speed of the explicit quantization(mean GPU time 2.2ms) is much slower than implicit(0.9ms).
And my question is that why TensorRT cannot use calibration info in the explicit quantization model to perform like implicit quantization, instead, must use Q/DQ node, which is slower than implicit quantization?
In other word, why the ptq model exported from pytorch_quantization cannot perform like trt internal ptq( plain TensorRT INT8 processing )
And why we cannot remove the q/dq layer of the explicit quantization model then use trt internal ptq