TensorRT explicit quantization layer fusion

Description

TensorRT processing of quantized ResNet50 ONNX graph (explicit quantization) does not perform all the layer fusions that it does in implicit quantization. In particular, the implicit quantization fuses the first convolution layer with the following maxpool layer, which does not occur with the explicitly quantized model. This gives the implicit quantization model about 15% higher throughput.

The TensorRT documentation does not mention the conditions needed for fusing convolution and maxpool layers. I experimented with multiple settings but was not able to force the fusion.

Environment

TensorRT Version: 8.2.3-1+cuda11.4
GPU Type: A100-SXM4-40GB
Nvidia Driver Version: 460.32.03
CUDA Version: 11.6
CUDNN Version: 8.3
Operating System + Version: Ubuntu 20.04.2 LTS
Python Version (if applicable): 3.8.10
TensorFlow Version (if applicable): not applicable
PyTorch Version (if applicable): not applicable
Baremetal or Container (if container which image + tag): tensorrt:22.02-py3 (NGC catalog)

Relevant Files

ONNX graphs: resnet50.onnx (FP32) and resnet50_fake_ptq.onnx (explicit quantization)
Layer profiles (generated by trtexec): resnet50_profile.json (implicit quantization) and resnet50_fake_ptq.json (explicit quantization)
resnet50_fake_ptq_profile.json (10.8 KB)
resnet50_fake_ptq.onnx (97.8 MB)
resnet50_profile.json (7.6 KB)
resnet50.onnx (97.7 MB)

Steps To Reproduce

Using the docker container listed above I benchmark the performance using trtexec:

Implicit quantization
trtexec --onnx=resnet50.onnx --int8 --shapes=input:128x3x224x224

Explicit quantization
trtexec --onnx=resnet50_fake_ptq.onnx --int8 --shapes=input:128x3x224x224

I can inspect the fusion of layers by enabling layer profiling with the flags --exportPorfile and --separateProfileRun:

Implicit quantization
trtexec --onnx=resnet50.onnx --int8 --shapes=input:128x3x224x224 --exportProfile=resnet50_profile.json --separateProfileRun

Explicit quantization
trtexec --onnx=resnet50_fake_ptq.onnx --int8 --shapes=input:128x3x224x224 --exportProfile=resnet50_fake_quant_profile.json

Thank you for sharing the issue repro model. Our team will work on this issue. Please allow us sometime.

Hello. I haven’t heard back on this issue for over a month. Are there updates?

Hi, Please refer to the below links to perform inference in INT8

Thanks!

Thank you for the reply, but the answer does not address the issue I raised.

I can successfully run quantized models through TensorRT both with implicit quantization (the approach described in the documentation you shared) and explicit quantization. The issue I raised is that execution of quantized ResNet50 via explicit and implicit quantization are displaying performance differences of the order of 15%. In particular, this discrepancy stems from the fact that in implicit quantization there is a layer fusion (max pooling + conv) that does not occur when using explicit quantization.

My question is whether there is a particular graph configuration in the explicit quantization case that will promote the layer fusion.