8bit quantized onnx file and its 8bit engine inference results differ


A clear and concise description of the bug or issue.


TensorRT Version:
GPU Type: RTX2080
Nvidia Driver Version: 470.63.01
CUDA Version: 11.4
CUDNN Version: 8.5
Operating System + Version: Ubuntu 18.04
Python Version (if applicable): 1.7
PyTorch Version (if applicable): 1.9

I’m working on producing inference on a 8bit quantized model. I went through quantization process, exporting ONNX file. The ONNX runtime inference works correctly, while inference on a TensorRT engine of this ONNX file does not work correctly at all.
I also tried to compare between the ONNX and TesnorRT inference with Polygraph python API, however all outputs comparison return FAIL, while one output statistics comparison even return nans.

Here is my simple script in Polygraph:
build_onnxrt_session = SessionFromOnnx(’./quantized_detnet.onnx’)
engine = engine_from_bytes(bytes_from_path(’./quantized_detnet.eng"))

runners = [
run_results = Comparator.run(runners)

I also paid attention that the engine input and output positions were swapped in the engine comparing to the ONNX file (and original model)

I attach both ONNX model and engine files produced from it.

quantized_detnet.onnx (12.7 MB)quantized_detnet.eng (10.0 MB)

Thank you.

Request you to share the ONNX model and the script if not shared already so that we can assist you better.
Alongside you can try few things:

  1. validating your model with the below snippet


import sys
import onnx
filename = yourONNXmodel
model = onnx.load(filename)
2) Try running your model with trtexec command.
In case you are still facing issue, request you to share the trtexec “”–verbose"" log for further debugging


Are you still facing this issue.

Thank you.