TensorRT --fp16 pre and post Int8 quantization

PavanMV · September 2, 2024, 7:13am

Description

Observed speed improvement in TensorRT --fp16 pre and post int8 quantization, What could be the underlying reason for this performance improvement?

Environment

TensorRT Version: v100100
GPU Type: L4
Nvidia Driver Version: 550.90.07
CUDA Version: 12.5
*CUDNN Version:
Operating System + Version: Ubuntu 22.04.4 LTS
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Steps To Reproduce

Scenario 1: TensorRT --fp16 pre quantization

Converted original Pytorch —> Onnx—>onnxsim —> TRT
Used command: /usr/src/tensorrt/bin/trtexec --fp16 --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --minShapes=input:1x12x3x224x224 --optShapes=input:4x12x3x224x224 --verbose --maxShapes=input:16x12x3x224x224 --onnx=original.onnx --memPoolSize=workspace:21034 --profilingVerbosity=detailed --exportLayerInfo=graph.json --builderOptimizationLevel=5 --saveEngine=model.plan
Latency: 25qps (triton + perf analyzer)

Scenario 2: TensorRT --fp16 post quantization(int8) using ModelOpt

Using modelopt.torch.quantization quantized original torch model to Int8.
Converted torch—>Onnx—>onnxsim
Used Command: /usr/src/tensorrt/bin/trtexec --fp16 --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --minShapes=input:1x12x3x224x224 --optShapes=input:4x12x3x224x224 --verbose --maxShapes=input:16x12x3x224x224 --onnx=mtq_onnxsim.onnx --memPoolSize=workspace:21034 --profilingVerbosity=detailed --exportLayerInfo=graph.json --builderOptimizationLevel=5 --saveEngine=model.plan
Latency: 29qps (triton + perf analyzer)

There is approx 15% increase in speed pre and post quantization.

Our understanding is --fp16 flag will not do any Int8 optimizations, so was not expecting an improvement in post quant fp16. But post quant --fp16 is faster, what could be the reason? Thanks.

sathish.iitd · September 2, 2024, 7:41am

We are also observing this issue.

Topic		Replies	Views
Post quantization aware training is slower than fp16 and post quantization TensorRT	12	2850	September 25, 2024
QAT int8 TRT engine slower than fp16 TensorRT tensorrt , pytorch , python , onnx	3	2452	January 6, 2022
TRT Engin in INT8 is much slower than FP16 TensorRT	4	2079	November 11, 2021
[Hugging Face transformer models + pytorch_quantization] PTQ quantization int8 is slower than fp16 TensorRT tensorrt , python , onnx , natural-language-processing-nlp	4	3124	January 6, 2022
The inference speed of yolov5 tensorrt has little difference between int8 and fp16 TensorRT tensorrt , cuda	1	1606	September 8, 2022
ConvNeXT inference with int8 quantization slower on tensorRT than fp32/fp16 TensorRT cudnn , tensorrt-model-optimizer	2	235	September 19, 2025
Data inferencing to INT8U quantized model TensorRT tensorrt	2	473	October 12, 2021
Why is' int8 'not as fast as' fp16' TensorRT tensorrt	1	627	February 1, 2021
TensorRT int8 slower than FP16 due to reformat layer TensorRT tensorrt , cudnn	0	178	October 11, 2024
Jetson Thor - INT8 quantization show no performance gain over FP16 Jetson Thor tensorrt , jetson-inference , tensorrt-model-optimizer	4	90	December 18, 2025

TensorRT --fp16 pre and post Int8 quantization

Description

Environment

Steps To Reproduce

Related topics