But I found out the performance of int8 is much slower than fp16.
with trtexec, fp16 reaches 346.861 qps, and int8 reaches 217.914 qps.
Here is the model with quanziation/dequantization node epoch_15.onnx (1.7 MB)
, and here are the model without quanziation/dequantization node epoch_250.onnx (1.6 MB)
Hi , UFF and Caffe Parser have been deprecated from TensorRT 7 onwards, hence request you to try ONNX parser.
Please check the below link for the same.
After our team working on this identified that QAT int inference is slower than fp16 inference is because the model is running in mixed precision. In order to run the whole network with int8 precision, additional Q and DQ layers are required to be inserted between BN and leakyRelu layer.
Just wanted to share some of our observations. Hope that helps fellow developers and saves some headaches.
First of all it is recommended to read and re-read Explicit-Quantization part of TensorRT docs, especially Q/DQ Layer-Placement Recommendations section. Most of the behaviour we were trying to get our heads around is actually explained there.
For example, the fact that we might need to place Q/DQ nodes before element-wise addition to fuse and quantise it in residual layers (like in Resnets or CSPDarknets). So build your engine with --verbose flag, check logs - look for Engine Layer Information part which we found the most useful, and try experimenting with additional Q/DQ nodes.
Also in addition to @spolisetty’s observations with LeakyReLU, we also had to place additional Q/DQ nodes between BN and SiLU layers to force them to run in int8. As we understand and as mentioned in the docs they are not fused and quantised by default because “It’s sometimes useful to preserve the higher-precision dequantized output. For example, we can want to follow the linear operation by an activation function (SiLU, in the following diagram) that requires higher precision to improve network accuracy.” But in our case we did not have any drops in performance with QAT training.
Have a look at Quantizing Resnet50 — pytorch-quantization master documentation. There is an example how a residual quantizer is inserted in BasicBlock and Bottleneck layers. You need to 1) initialize additional quantizer in those layers with BN and LeakyReLU and 2) then change forward call so the input goes from BN to that quantizer firtst and then to LeakyReLU.