Int8 performance is less than fp16

Description: int8 performance is less than fp16

NVIDIA Release: 22.05
NVIDIA TensorRT Version: 8.2.5
NVIDIA Driver Version: 515.47.03
CUDA Version: 11.7
Docker Image:

Model location: bert_pyt_onnx_large_qa_squad11_amp | NVIDIA NGC
(Download command: wget ‘’)

Command: trtexec --useCudaGraph --loadEngine=<ENGINE_NAME>.trt --shapes=segment_ids:1x384,input_mask:1x384,input_ids:1x384 --duration=300 --verbose <PRECISION_FLAG>

Model Name Seq length Precision Batch Size Throughput E2E Latency(ms) GPU Latency(ms)
bert_large_v1_1 384 INT8 1 178.08 11.0478 5.6124
bert_large_v1_1 384 FP16 1 340.16 5.7558 2.9348
bert_large_v1_1 384 best 1 339.91 5.7650 2.9370

Request you to share the ONNX model and the script if not shared already so that we can assist you better.
Alongside you can try few things:

  1. validating your model with the below snippet

import sys
import onnx
filename = yourONNXmodel
model = onnx.load(filename)
2) Try running your model with trtexec command.

In case you are still facing issue, request you to share the trtexec “”–verbose"" log for further debugging

I checked the model using I don’t see any error in ONNX model.

Following are the commands I used to get inference data:

  1. For INT8 precision:
    trtexec --useCudaGraph --loadEngine=bert_large_v1_1-int8.trt --int8 --shapes=segment_ids:1x384,input_mask:1x384,input_ids:1x384 --duration=300 --verbose

  2. For FP16 precision:
    trtexec --useCudaGraph --loadEngine=bert_large_v1_1-fp16.trt --fp16 --shapes=segment_ids:1x384,input_mask:1x384,input_ids:1x384 --duration=300 --verbose

  3. For BEST precision:
    trtexec --useCudaGraph --loadEngine=bert_large_v1_1-best.trt --best --shapes=segment_ids:1x384,input_mask:1x384,input_ids:1x384 --duration=300 --verbose


We do not support INT8-PTQ for the ONNX-BERT path yet.
To use ONNX-BERT with INT8, please use the QAT path (by explicitly inserting Q/DQ nodes).
Please refer,

Thank you.