8bit quantized onnx file and its 8bit engine inference results differ

Description

A clear and concise description of the bug or issue.

Environment

TensorRT Version: 8.2.0.6
GPU Type: RTX2080
Nvidia Driver Version: 470.63.01
CUDA Version: 11.4
CUDNN Version: 8.5
Operating System + Version: Ubuntu 18.04
Python Version (if applicable): 1.7
PyTorch Version (if applicable): 1.9

Hello,
I’m working on producing inference on a 8bit quantized model. I went through quantization process, exporting ONNX file. The ONNX runtime inference works correctly, while inference on a TensorRT engine of this ONNX file does not work correctly at all.
I also tried to compare between the ONNX and TesnorRT inference with Polygraph python API, however all outputs comparison return FAIL, while one output statistics comparison even return nans.

Here is my simple script in Polygraph:
build_onnxrt_session = SessionFromOnnx(‘./quantized_detnet.onnx’)
engine = engine_from_bytes(bytes_from_path('./quantized_detnet.eng"))

runners = [
TrtRunner(engine),
OnnxrtRunner(build_onnxrt_session)
]
run_results = Comparator.run(runners)

I also paid attention that the engine input and output positions were swapped in the engine comparing to the ONNX file (and original model)

I attach both ONNX model and engine files produced from it.

quantized_detnet.onnx (12.7 MB)quantized_detnet.eng (10.0 MB)

Thank you.

Hi,
Request you to share the ONNX model and the script if not shared already so that we can assist you better.
Alongside you can try few things:

  1. validating your model with the below snippet

check_model.py

import sys
import onnx
filename = yourONNXmodel
model = onnx.load(filename)
onnx.checker.check_model(model).
2) Try running your model with trtexec command.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec
In case you are still facing issue, request you to share the trtexec “”–verbose"" log for further debugging
Thanks!

Hi,

Are you still facing this issue.

Thank you.