TensorRT explicit quantization layer fusion

alexandre_marques · March 21, 2022, 7:54pm

Description

TensorRT processing of quantized ResNet50 ONNX graph (explicit quantization) does not perform all the layer fusions that it does in implicit quantization. In particular, the implicit quantization fuses the first convolution layer with the following maxpool layer, which does not occur with the explicitly quantized model. This gives the implicit quantization model about 15% higher throughput.

The TensorRT documentation does not mention the conditions needed for fusing convolution and maxpool layers. I experimented with multiple settings but was not able to force the fusion.

Environment

TensorRT Version: 8.2.3-1+cuda11.4
GPU Type: A100-SXM4-40GB
Nvidia Driver Version: 460.32.03
CUDA Version: 11.6
CUDNN Version: 8.3
Operating System + Version: Ubuntu 20.04.2 LTS
Python Version (if applicable): 3.8.10
TensorFlow Version (if applicable): not applicable
PyTorch Version (if applicable): not applicable
Baremetal or Container (if container which image + tag): tensorrt:22.02-py3 (NGC catalog)

Relevant Files

ONNX graphs: resnet50.onnx (FP32) and resnet50_fake_ptq.onnx (explicit quantization)
Layer profiles (generated by trtexec): resnet50_profile.json (implicit quantization) and resnet50_fake_ptq.json (explicit quantization)
resnet50_fake_ptq_profile.json (10.8 KB)
resnet50_fake_ptq.onnx (97.8 MB)
resnet50_profile.json (7.6 KB)
resnet50.onnx (97.7 MB)

Steps To Reproduce

Using the docker container listed above I benchmark the performance using trtexec:

Implicit quantization
trtexec --onnx=resnet50.onnx --int8 --shapes=input:128x3x224x224

Explicit quantization
trtexec --onnx=resnet50_fake_ptq.onnx --int8 --shapes=input:128x3x224x224

I can inspect the fusion of layers by enabling layer profiling with the flags --exportPorfile and --separateProfileRun:

Implicit quantization
trtexec --onnx=resnet50.onnx --int8 --shapes=input:128x3x224x224 --exportProfile=resnet50_profile.json --separateProfileRun

Explicit quantization
trtexec --onnx=resnet50_fake_ptq.onnx --int8 --shapes=input:128x3x224x224 --exportProfile=resnet50_fake_quant_profile.json

spolisetty · March 25, 2022, 4:56am

Thank you for sharing the issue repro model. Our team will work on this issue. Please allow us sometime.

alexandre_marques · May 3, 2022, 3:24pm

Hello. I haven’t heard back on this issue for over a month. Are there updates?

NVES · May 3, 2022, 3:37pm

Hi, Please refer to the below links to perform inference in INT8

Thanks!

alexandre_marques · May 3, 2022, 4:56pm

Thank you for the reply, but the answer does not address the issue I raised.

I can successfully run quantized models through TensorRT both with implicit quantization (the approach described in the documentation you shared) and explicit quantization. The issue I raised is that execution of quantized ResNet50 via explicit and implicit quantization are displaying performance differences of the order of 15%. In particular, this discrepancy stems from the fact that in implicit quantization there is a layer fusion (max pooling + conv) that does not occur when using explicit quantization.

My question is whether there is a particular graph configuration in the explicit quantization case that will promote the layer fusion.

Topic		Replies	Views
Practical aspects about neural networks quantization with TensorRT TensorRT tensorrt	1	800	March 31, 2023
Explicit quantization vs implicit quantization TensorRT	3	1847	April 26, 2022
Accelerating Quantized Networks with the NVIDIA QAT Toolkit for TensorFlow and NVIDIA TensorRT Technical Blog	0	393	June 16, 2022
Tensorrt inferencing getting failed with custom quantized int 8 TensorFlow model TensorRT tensorrt , ubuntu , python , cudnn	1	17	March 28, 2025
Structured sparsity not working with explicit quantization TensorRT tensorrt	5	954	March 31, 2022
ONNX Model Int64 Weights TensorRT	12	13384	February 17, 2024
TensorRT conversion issues of ONNX model trained with Quantization Aware Training + custom quantization scale TensorRT tensorrt	5	1379	April 14, 2021
How Tensorrt's horizontal layer fusion output data works? TensorRT tensorrt	1	361	November 30, 2023
Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT Technical Blog	1	835	December 3, 2023
Fake quantization ONNX model parse ERROR using TensorRT 8 TensorRT	3	792	September 27, 2021

TensorRT explicit quantization layer fusion

Description

Environment

Relevant Files

Steps To Reproduce

Related topics