QAT int8 TRT engine slower than fp16

Description

I have followed several tutorials to perform a QAT on an efficientNet model with pytorch. First, this implementation doesn’t natively support QAT, by slightly changing the Conv2dStaticSamePadding, I could make it work with pytorch_quantization library.
Following this example and this documentation I finally managed to come up with a int8 quantized model that performs as good as its fp16 version.

The point of my post is that I can’t understand why this int8 model is slower than the fp16 version. I ran a trtexec benchmark of both of them on my AGX this is the results :

FP16, BatchSize 32, EfficientNetB0, 32x3x100x100 : 9.8ms
INT8, BatchSize 32, EfficientNetB0, 32x3x100x100 : 18ms

The results are correct and both versions are doing great, the problem is obviously that I expected the INT8 version to be much faster than the FP16 one. I suspect TensorRT didn’t fused some layers or is doing extra computations due to Quantize Layers. To be honest I’m not sure.
This is a screenshot of the int8 onnx model that performs at 18ms.

Basically the steps I have followed are :

  • Train an original version of EfficientNet
  • Export the weights on my modified (QAT-compatible) EfficientNet model
  • Calibrate the model exactly as in the VGG QAT notebook above
  • Add the fake_quantize_per_channel_affine function to symbolic_opset10.py as here.
  • Export the onnx

Environment

TensorRT Version: 8.0.1.6
GPU Type: 512-core Volta GPU with Tensor Cores
Nvidia Driver Version: jetpack 4.6
CUDA Version: 10.2
CUDNN Version: 8.2
Operating System + Version: Ubuntu 18.0.4.5 LTS
Python Version: 3.8
PyTorch Version: 1.9

Steps to reproduce

From this github repo, replace Conv2dStaticSamePadding class, to this :

class Conv2dStaticSamePadding(nn.Module):

    def __init__(self, in_channels, out_channels, kernel_size, stride=1, image_size=None, **kwargs):
        super().__init__()
        self._homemadeConv2d = nn.Conv2d(in_channels, out_channels, kernel_size = kernel_size, stride=stride, **kwargs)
        
        self._homemadeConv2d.stride = (self._homemadeConv2d.stride if len(self._homemadeConv2d.stride) == 2 else [self._homemadeConv2d.stride[0]] * 2)

        self.o_c = out_channels
        assert image_size is not None
        ih, iw = (image_size, image_size) if isinstance(image_size, int) else image_size
        kh, kw = self._homemadeConv2d.weight.size()[-2:]
        sh, sw = self._homemadeConv2d.stride
        oh, ow = math.ceil(ih / sh), math.ceil(iw / sw)
        pad_h = max((oh - 1) * self._homemadeConv2d.stride[0] + (kh - 1) * self._homemadeConv2d.dilation[0] + 1 - ih, 0)
        pad_w = max((ow - 1) * self._homemadeConv2d.stride[1] + (kw - 1) * self._homemadeConv2d.dilation[1] + 1 - iw, 0)
        
        
        
        if pad_h > 0 or pad_w > 0:
            self.static_padding = nn.ZeroPad2d((pad_w // 2, pad_w - pad_w // 2,
                                                pad_h // 2, pad_h - pad_h // 2))
        else:
            self.static_padding = nn.Identity()

    def forward(self, x):
        x = self.static_padding(x)
        x = self._homemadeConv2d(x)
        return x

Then, simply follow this tutorial.

Hi, Please refer to the below links to perform inference in INT8

Thanks!

Thank you for your reply but my problem is not to perform a int8 inference with tensorRT it’s about the generation of an int8 engine. I can use the engine without any issue, my point is about the inference time in int8 that is slower than fp16. I’ve seen that is a reported problem, please let me know if you have any advice.
Thanks

Hi,

Still we are not sure that the perf of int8 is better than perf of fp16 in post you’ve mentioned.
Also following similar issue may give you more inputs.
https://github.com/NVIDIA/TensorRT/issues/993

Thank you.