QAT int8 TRT engine slower than fp16

maxime5 · December 17, 2021, 3:56pm

Description

I have followed several tutorials to perform a QAT on an efficientNet model with pytorch. First, this implementation doesn’t natively support QAT, by slightly changing the Conv2dStaticSamePadding, I could make it work with pytorch_quantization library.
Following this example and this documentation I finally managed to come up with a int8 quantized model that performs as good as its fp16 version.

The point of my post is that I can’t understand why this int8 model is slower than the fp16 version. I ran a trtexec benchmark of both of them on my AGX this is the results :

FP16, BatchSize 32, EfficientNetB0, 32x3x100x100 : 9.8ms
INT8, BatchSize 32, EfficientNetB0, 32x3x100x100 : 18ms

The results are correct and both versions are doing great, the problem is obviously that I expected the INT8 version to be much faster than the FP16 one. I suspect TensorRT didn’t fused some layers or is doing extra computations due to Quantize Layers. To be honest I’m not sure.
This is a screenshot of the int8 onnx model that performs at 18ms.

Basically the steps I have followed are :

Train an original version of EfficientNet
Export the weights on my modified (QAT-compatible) EfficientNet model
Calibrate the model exactly as in the VGG QAT notebook above
Add the fake_quantize_per_channel_affine function to symbolic_opset10.py as here.
Export the onnx

Environment

TensorRT Version: 8.0.1.6
GPU Type: 512-core Volta GPU with Tensor Cores
Nvidia Driver Version: jetpack 4.6
CUDA Version: 10.2
CUDNN Version: 8.2
Operating System + Version: Ubuntu 18.0.4.5 LTS
Python Version: 3.8
PyTorch Version: 1.9

Steps to reproduce

From this github repo, replace Conv2dStaticSamePadding class, to this :

class Conv2dStaticSamePadding(nn.Module):

    def __init__(self, in_channels, out_channels, kernel_size, stride=1, image_size=None, **kwargs):
        super().__init__()
        self._homemadeConv2d = nn.Conv2d(in_channels, out_channels, kernel_size = kernel_size, stride=stride, **kwargs)
        
        self._homemadeConv2d.stride = (self._homemadeConv2d.stride if len(self._homemadeConv2d.stride) == 2 else [self._homemadeConv2d.stride[0]] * 2)

        self.o_c = out_channels
        assert image_size is not None
        ih, iw = (image_size, image_size) if isinstance(image_size, int) else image_size
        kh, kw = self._homemadeConv2d.weight.size()[-2:]
        sh, sw = self._homemadeConv2d.stride
        oh, ow = math.ceil(ih / sh), math.ceil(iw / sw)
        pad_h = max((oh - 1) * self._homemadeConv2d.stride[0] + (kh - 1) * self._homemadeConv2d.dilation[0] + 1 - ih, 0)
        pad_w = max((ow - 1) * self._homemadeConv2d.stride[1] + (kw - 1) * self._homemadeConv2d.dilation[1] + 1 - iw, 0)
        
        
        
        if pad_h > 0 or pad_w > 0:
            self.static_padding = nn.ZeroPad2d((pad_w // 2, pad_w - pad_w // 2,
                                                pad_h // 2, pad_h - pad_h // 2))
        else:
            self.static_padding = nn.Identity()

    def forward(self, x):
        x = self.static_padding(x)
        x = self._homemadeConv2d(x)
        return x

Then, simply follow this tutorial.

NVES · December 17, 2021, 4:38pm

Hi, Please refer to the below links to perform inference in INT8

Thanks!

maxime5 · December 21, 2021, 11:30am

Thank you for your reply but my problem is not to perform a int8 inference with tensorRT it’s about the generation of an int8 engine. I can use the engine without any issue, my point is about the inference time in int8 that is slower than fp16. I’ve seen that is a reported problem, please let me know if you have any advice.
Thanks

spolisetty · January 6, 2022, 1:17pm

Hi,

Still we are not sure that the perf of int8 is better than perf of fp16 in post you’ve mentioned.
Also following similar issue may give you more inputs.
https://github.com/NVIDIA/TensorRT/issues/993

Thank you.

Topic		Replies	Views
TRT Engin in INT8 is much slower than FP16 TensorRT	4	2007	November 11, 2021
[Hugging Face transformer models + pytorch_quantization] PTQ quantization int8 is slower than fp16 TensorRT tensorrt , python , onnx , natural-language-processing-nlp	4	3071	January 6, 2022
Same inference speed for INT8 and FP16 TensorRT	10	5968	October 12, 2021
How to verify if QAT TRT engine is indeed INT8 on Xavier Jetson AGX Xavier tensorrt	16	625	October 5, 2022
How can we know we have convert the onnx to int8trt rather than Float32? TensorRT tensorrt	23	1922	June 14, 2021
Why is' int8 'not as fast as' fp16' TensorRT tensorrt	1	577	February 1, 2021
Post quantization aware training is slower than fp16 and post quantization TensorRT	12	2750	September 25, 2024
Int8 is not faster than fp16 on xavier Jetson AGX Xavier tensorrt	5	785	October 18, 2021
Little performance difference between int8 and fp16 on RTX2080 TensorRT	4	2615	July 5, 2021
The inference speed of yolov5 tensorrt has little difference between int8 and fp16 TensorRT tensorrt , cuda	1	1561	September 8, 2022

QAT int8 TRT engine slower than fp16

Description

Basically the steps I have followed are :

Environment

Steps to reproduce

Related topics