Description
I have followed several tutorials to perform a QAT on an efficientNet model with pytorch. First, this implementation doesn’t natively support QAT, by slightly changing the Conv2dStaticSamePadding, I could make it work with pytorch_quantization library.
Following this example and this documentation I finally managed to come up with a int8 quantized model that performs as good as its fp16 version.
The point of my post is that I can’t understand why this int8 model is slower than the fp16 version. I ran a trtexec benchmark of both of them on my AGX this is the results :
FP16, BatchSize 32, EfficientNetB0, 32x3x100x100 : 9.8ms
INT8, BatchSize 32, EfficientNetB0, 32x3x100x100 : 18ms
The results are correct and both versions are doing great, the problem is obviously that I expected the INT8 version to be much faster than the FP16 one. I suspect TensorRT didn’t fused some layers or is doing extra computations due to Quantize Layers. To be honest I’m not sure.
This is a screenshot of the int8 onnx model that performs at 18ms.
Basically the steps I have followed are :
- Train an original version of EfficientNet
- Export the weights on my modified (QAT-compatible) EfficientNet model
- Calibrate the model exactly as in the VGG QAT notebook above
- Add the fake_quantize_per_channel_affine function to symbolic_opset10.py as here.
- Export the onnx
Environment
TensorRT Version: 8.0.1.6
GPU Type: 512-core Volta GPU with Tensor Cores
Nvidia Driver Version: jetpack 4.6
CUDA Version: 10.2
CUDNN Version: 8.2
Operating System + Version: Ubuntu 18.0.4.5 LTS
Python Version: 3.8
PyTorch Version: 1.9
Steps to reproduce
From this github repo, replace Conv2dStaticSamePadding class, to this :
class Conv2dStaticSamePadding(nn.Module):
def __init__(self, in_channels, out_channels, kernel_size, stride=1, image_size=None, **kwargs):
super().__init__()
self._homemadeConv2d = nn.Conv2d(in_channels, out_channels, kernel_size = kernel_size, stride=stride, **kwargs)
self._homemadeConv2d.stride = (self._homemadeConv2d.stride if len(self._homemadeConv2d.stride) == 2 else [self._homemadeConv2d.stride[0]] * 2)
self.o_c = out_channels
assert image_size is not None
ih, iw = (image_size, image_size) if isinstance(image_size, int) else image_size
kh, kw = self._homemadeConv2d.weight.size()[-2:]
sh, sw = self._homemadeConv2d.stride
oh, ow = math.ceil(ih / sh), math.ceil(iw / sw)
pad_h = max((oh - 1) * self._homemadeConv2d.stride[0] + (kh - 1) * self._homemadeConv2d.dilation[0] + 1 - ih, 0)
pad_w = max((ow - 1) * self._homemadeConv2d.stride[1] + (kw - 1) * self._homemadeConv2d.dilation[1] + 1 - iw, 0)
if pad_h > 0 or pad_w > 0:
self.static_padding = nn.ZeroPad2d((pad_w // 2, pad_w - pad_w // 2,
pad_h // 2, pad_h - pad_h // 2))
else:
self.static_padding = nn.Identity()
def forward(self, x):
x = self.static_padding(x)
x = self._homemadeConv2d(x)
return x
Then, simply follow this tutorial.