Description
Observed speed improvement in TensorRT --fp16 pre and post int8 quantization, What could be the underlying reason for this performance improvement?
Environment
TensorRT Version: v100100
GPU Type: L4
Nvidia Driver Version: 550.90.07
CUDA Version: 12.5
*CUDNN Version:
Operating System + Version: Ubuntu 22.04.4 LTS
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):
Steps To Reproduce
Scenario 1: TensorRT --fp16 pre quantization
- Converted original Pytorch —> Onnx—>onnxsim —> TRT
- Used command: /usr/src/tensorrt/bin/trtexec --fp16 --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --minShapes=input:1x12x3x224x224 --optShapes=input:4x12x3x224x224 --verbose --maxShapes=input:16x12x3x224x224 --onnx=original.onnx --memPoolSize=workspace:21034 --profilingVerbosity=detailed --exportLayerInfo=graph.json --builderOptimizationLevel=5 --saveEngine=model.plan
- Latency: 25qps (triton + perf analyzer)
Scenario 2: TensorRT --fp16 post quantization(int8) using ModelOpt
- Using modelopt.torch.quantization quantized original torch model to Int8.
- Converted torch—>Onnx—>onnxsim
- Used Command: /usr/src/tensorrt/bin/trtexec --fp16 --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --minShapes=input:1x12x3x224x224 --optShapes=input:4x12x3x224x224 --verbose --maxShapes=input:16x12x3x224x224 --onnx=mtq_onnxsim.onnx --memPoolSize=workspace:21034 --profilingVerbosity=detailed --exportLayerInfo=graph.json --builderOptimizationLevel=5 --saveEngine=model.plan
- Latency: 29qps (triton + perf analyzer)
There is approx 15% increase in speed pre and post quantization.
Our understanding is --fp16 flag will not do any Int8 optimizations, so was not expecting an improvement in post quant fp16. But post quant --fp16 is faster, what could be the reason? Thanks.