I have followed several tutorials to perform a QAT on an efficientNet model with pytorch. First, this implementation doesn’t natively support QAT, by slightly changing the Conv2dStaticSamePadding, I could make it work with pytorch_quantization library.
Following this example and this documentation I finally managed to come up with a int8 quantized model that performs as good as its fp16 version.
The point of my post is that I can’t understand why this int8 model is slower than the fp16 version. I ran a trtexec benchmark of both of them on my AGX this is the results :
FP16, BatchSize 32, EfficientNetB0, 32x3x100x100 : 9.8ms
INT8, BatchSize 32, EfficientNetB0, 32x3x100x100 : 18ms
The results are correct and both versions are doing great, the problem is obviously that I expected the INT8 version to be much faster than the FP16 one. I suspect TensorRT didn’t fused some layers or is doing extra computations due to Quantize Layers. To be honest I’m not sure.
This is a screenshot of the int8 onnx model that performs at 18ms.
- Train an original version of EfficientNet
- Export the weights on my modified (QAT-compatible) EfficientNet model
- Calibrate the model exactly as in the VGG QAT notebook above
- Add the fake_quantize_per_channel_affine function to symbolic_opset10.py as here.
- Export the onnx
TensorRT Version: 188.8.131.52
GPU Type: 512-core Volta GPU with Tensor Cores
Nvidia Driver Version: jetpack 4.6
CUDA Version: 10.2
CUDNN Version: 8.2
Operating System + Version: Ubuntu 184.108.40.206 LTS
Python Version: 3.8
PyTorch Version: 1.9
From this github repo, replace Conv2dStaticSamePadding class, to this :
class Conv2dStaticSamePadding(nn.Module): def __init__(self, in_channels, out_channels, kernel_size, stride=1, image_size=None, **kwargs): super().__init__() self._homemadeConv2d = nn.Conv2d(in_channels, out_channels, kernel_size = kernel_size, stride=stride, **kwargs) self._homemadeConv2d.stride = (self._homemadeConv2d.stride if len(self._homemadeConv2d.stride) == 2 else [self._homemadeConv2d.stride] * 2) self.o_c = out_channels assert image_size is not None ih, iw = (image_size, image_size) if isinstance(image_size, int) else image_size kh, kw = self._homemadeConv2d.weight.size()[-2:] sh, sw = self._homemadeConv2d.stride oh, ow = math.ceil(ih / sh), math.ceil(iw / sw) pad_h = max((oh - 1) * self._homemadeConv2d.stride + (kh - 1) * self._homemadeConv2d.dilation + 1 - ih, 0) pad_w = max((ow - 1) * self._homemadeConv2d.stride + (kw - 1) * self._homemadeConv2d.dilation + 1 - iw, 0) if pad_h > 0 or pad_w > 0: self.static_padding = nn.ZeroPad2d((pad_w // 2, pad_w - pad_w // 2, pad_h // 2, pad_h - pad_h // 2)) else: self.static_padding = nn.Identity() def forward(self, x): x = self.static_padding(x) x = self._homemadeConv2d(x) return x
Then, simply follow this tutorial.