In the ONNX-TensorRT operator support list (https://github.com/onnx/onnx-tensorrt/blob/main/docs/operators.md), it shows that HardSwish exported from ONNX can support INT8 inference. However, when I tried the simplest network with INT8 in TensorRT, I noticed HardSwish was not quantized to INT8.
May I ask why HardSwish was not quantized to INT8 in my test? . Please advise what could be the reason it did not quantize HardSwish to INT8 in my test case. Thank you!
Environment
TensorRT Version: 8.6.1 GPU Type: 3060 Nvidia Driver Version: 523 CUDA Version: 12.2 CUDNN Version: Operating System + Version: Python Version (if applicable): 3.10 TensorFlow Version (if applicable): PyTorch Version (if applicable): 2.0 Baremetal or Container (if container which image + tag):
Relevant Files
Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)
Hi,
Can you try running your model with trtexec command, and share the “”–verbose"" log in case if the issue persist
You can refer below link for all the supported operators list, in case any operator is not supported you need to create a custom plugin to support that operation
Also, request you to share your model and script if not shared already so that we can help you better.
Meanwhile, for some common errors and queries please refer to below link:
I’m very sorry for the late reply. I have uploaded the code and models. I checked the supported operators list, and it says HardSwish supports int8 calculation, but you can see in ori_layer.json.svg that HardSwish used floating point for the computation. Aside from the question of how to get HardSwish to do quantized calculation, I did Q/DQ quantization while keeping the scales consistent, but there was still a lot of float data reformat. Why is that? Thank you for your reply. demoqdq.zip (455.0 KB)
In the compressed package, demo.ipynb is the network code and some operations. ori.onnx is the exported model. ori_verbose.json is the verbose information.
Q1: I tried to add residual_quantizer, but the node was not found in final onnx, and the residual_quantizer could not be found in the log. But it does reduce data flow, which is great but incomprehensible.
Q2: Although the network is a pure INT8 inference, its speed is not as fast as that of the engine without QDQ nodes. It can be clearly seen from ori_layer.json.svg and qdq_layer.json.svg The same Convolution node, in QDQ engine,
[ convolution
0.0729557 ms model.conv2.conv.weight
/model/conv2/conv/_weight_quantizer/QuantizeLinear
/model/conv2/conv/Conv]
In an engine without QDQ nodes.
[Convolution
0.0502896 ms
/model/conv2/conv/Conv]
and other convs are the same as this .
Thank you for your replay.
below is the new code and model and logs demoqdq.zip (461.7 KB)