TensorRT --fp16 pre and post Int8 quantization

Description

Observed speed improvement in TensorRT --fp16 pre and post int8 quantization, What could be the underlying reason for this performance improvement?

Environment

TensorRT Version: v100100
GPU Type: L4
Nvidia Driver Version: 550.90.07
CUDA Version: 12.5
*CUDNN Version:
Operating System + Version: Ubuntu 22.04.4 LTS
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Steps To Reproduce

Scenario 1: TensorRT --fp16 pre quantization

  • Converted original Pytorch —> Onnx—>onnxsim —> TRT
  • Used command: /usr/src/tensorrt/bin/trtexec --fp16 --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --minShapes=input:1x12x3x224x224 --optShapes=input:4x12x3x224x224 --verbose --maxShapes=input:16x12x3x224x224 --onnx=original.onnx --memPoolSize=workspace:21034 --profilingVerbosity=detailed --exportLayerInfo=graph.json --builderOptimizationLevel=5 --saveEngine=model.plan
  • Latency: 25qps (triton + perf analyzer)

Scenario 2: TensorRT --fp16 post quantization(int8) using ModelOpt

  • Using modelopt.torch.quantization quantized original torch model to Int8.
  • Converted torch—>Onnx—>onnxsim
  • Used Command: /usr/src/tensorrt/bin/trtexec --fp16 --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --minShapes=input:1x12x3x224x224 --optShapes=input:4x12x3x224x224 --verbose --maxShapes=input:16x12x3x224x224 --onnx=mtq_onnxsim.onnx --memPoolSize=workspace:21034 --profilingVerbosity=detailed --exportLayerInfo=graph.json --builderOptimizationLevel=5 --saveEngine=model.plan
  • Latency: 29qps (triton + perf analyzer)

There is approx 15% increase in speed pre and post quantization.

Our understanding is --fp16 flag will not do any Int8 optimizations, so was not expecting an improvement in post quant fp16. But post quant --fp16 is faster, what could be the reason? Thanks.

1 Like

We are also observing this issue.