TRT Engin in INT8 is much slower than FP16

lannyyip1 · November 2, 2021, 9:27am

Description

A clear and concise description of the bug or issue.

Environment

TensorRT Version: TensorRT 8.0.1
GPU Type: RTX 3070
Nvidia Driver Version: 470.63.01
CUDA Version: 11.3
CUDNN Version: 8.2.2.26
Operating System + Version: Ubuntu 20.04
Python Version (if applicable): 3.6
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.7
Baremetal or Container (if container which image + tag): Baremetal

Relevant Files

onnx_file
caliberation
int8_model
fp16_model
infer.py

Steps To Reproduce

Convert ONNX to TRT model in FP16 and INT8

./trtexec --onnx=model.onnx --minShapes=input0:1x1x1024x256 --optShapes=input0:1x1x1024x500 --maxShapes=input0:1x1x1024x650 --fp16 --workspace=5000 --verbose --saveEngine=model_fp16.bin
./trtexec --onnx=model.onnx --minShapes=input0:1x1x1024x256 --optShapes=input0:1x1x1024x500 --maxShapes=input0:1x1x1024x650 --int8 --calib=calibration.cache  --workspace=5000 --verbose --saveEngine=model_int8.bin

Infer data with these models
It shows that FP16 is much faster than INT8 model:

Time Used for model model_fp16.bin: 6.716s
Time Used for model model_int8.bin: 15.277s

Please help me on this.
Thank you.

Lanny

NVES · November 2, 2021, 12:39pm

Hi, Please refer to the below links to perform inference in INT8

Thanks!

lannyyip1 · November 3, 2021, 7:50am

Thank you. I checked the steps I used in converting ONNX model to TRT Int8 model, it is compliance with the steps you provided. But the model is much slower in INT8 than the one in FP16.

spolisetty · November 3, 2021, 7:56am

Hi,

It’s possible if many layers end up falling back to fp32. You’d probably want to enable both int8 and fp16 and check.
Could you please share trtexec --verbise logs for both FP16 and INT8 mode commands.

Thank you.

lannyyip1 · November 11, 2021, 2:45am

Thank you. I checked the output with --verbose, found the fallback to FP32. So it explains the reason why int8 model would be slower than FP16.
Thank you!