Int8 performance is less than fp16

DML · August 30, 2022, 5:00pm

Description: int8 performance is less than fp16

NVIDIA Release: 22.05
NVIDIA TensorRT Version: 8.2.5
NVIDIA Driver Version: 515.47.03
CUDA Version: 11.7
NVIDIA GPU: NVIDIA A100-PCIE-40GB
Docker Image: nvcr.io/nvidia/tensorrt:22.05-py3

Model location: bert_pyt_onnx_large_qa_squad11_amp | NVIDIA NGC
(Download command: wget ‘https://api.ngc.nvidia.com/v2/models/nvidia/bert_pyt_onnx_large_qa_squad11_amp/versions/1/files/bert_large_v1_1.onnx’)

Command: trtexec --useCudaGraph --loadEngine=<ENGINE_NAME>.trt --shapes=segment_ids:1x384,input_mask:1x384,input_ids:1x384 --duration=300 --verbose <PRECISION_FLAG>

Model Name	Seq length	Precision	Batch Size	Throughput	E2E Latency(ms)	GPU Latency(ms)
bert_large_v1_1	384	INT8	1	178.08	11.0478	5.6124
bert_large_v1_1	384	FP16	1	340.16	5.7558	2.9348
bert_large_v1_1	384	best	1	339.91	5.7650	2.9370

NVES · August 30, 2022, 5:37pm

Hi,
Request you to share the ONNX model and the script if not shared already so that we can assist you better.
Alongside you can try few things:

validating your model with the below snippet

check_model.py

import sys
import onnx
filename = yourONNXmodel
model = onnx.load(filename)
onnx.checker.check_model(model).
2) Try running your model with trtexec command.

In case you are still facing issue, request you to share the trtexec “”–verbose"" log for further debugging
Thanks!

DML · August 31, 2022, 3:33pm

I checked the model using check_model.py. I don’t see any error in ONNX model.

Following are the commands I used to get inference data:

For INT8 precision:
trtexec --useCudaGraph --loadEngine=bert_large_v1_1-int8.trt --int8 --shapes=segment_ids:1x384,input_mask:1x384,input_ids:1x384 --duration=300 --verbose
For FP16 precision:
trtexec --useCudaGraph --loadEngine=bert_large_v1_1-fp16.trt --fp16 --shapes=segment_ids:1x384,input_mask:1x384,input_ids:1x384 --duration=300 --verbose
For BEST precision:
trtexec --useCudaGraph --loadEngine=bert_large_v1_1-best.trt --best --shapes=segment_ids:1x384,input_mask:1x384,input_ids:1x384 --duration=300 --verbose

spolisetty · September 2, 2022, 3:30pm

Hi,

We do not support INT8-PTQ for the ONNX-BERT path yet.
To use ONNX-BERT with INT8, please use the QAT path (by explicitly inserting Q/DQ nodes).
Please refer,

Thank you.

Topic		Replies	Views
ONNX to TensorRT conversion (FP16 or FP32) results in integer outputs being mapped to near negative infinity (~2e-45) TensorRT tensorrt , cuda , onnx , aws , natural-language-processing-nlp , nlp	3	3324	June 6, 2022
TRT Engin in INT8 is much slower than FP16 TensorRT	4	1923	November 11, 2021
Inference time is not improving with the increase in batch size TensorRT	8	1847	June 1, 2022
ONNX Model INT8 Engine Build TensorRT tensorrt , jetson-inference , calibration , onnx	3	1923	July 26, 2022
Model does not get Int8 layers TensorRT	4	524	September 19, 2022
TensorRT INT8 inference accuracy TensorRT	2	498	May 9, 2022
TRT Uses INT 32 VS INT 16 TensorRT	3	994	October 12, 2021
ONNX/TensorRT INT64 Clamping. Why? TensorRT	4	699	July 6, 2023
How can we know we have convert the onnx to int8trt rather than Float32? TensorRT tensorrt	23	1873	June 14, 2021
Inference result gets worse when converting pytorch model to TensorRT model TensorRT pytorch	6	1139	January 19, 2022

Int8 performance is less than fp16

check_model.py

Related topics