I tried to benchmark int8 and fp16 for mobilenet0.25+ssd in jetson nx with jetpack 4.6.
for post training, i use pytorch-quantization toolkit (TensorRT/tools/pytorch-quantization at master · NVIDIA/TensorRT · GitHub) and generate the calibrated onnx.
But I found out the performance of int8 is much slower than fp16.
with trtexec, fp16 reaches 346.861 qps, and int8 reaches 217.914 qps.
Here is the model with quanziation/dequantization node epoch_15.onnx (1.7 MB)
, and here are the model without quanziation/dequantization node epoch_250.onnx (1.6 MB)
and here is the trtexec log from int8
int8.txt (28.7 KB)
and here is fp16
fp16.txt (30.0 KB)
Hi , UFF and Caffe Parser have been deprecated from TensorRT 7 onwards, hence request you to try ONNX parser.
Please check the below link for the same.
I did not use uff nor caffe parser. I am using onnx parser exactly. please look into my question. thank you. this issue is very close to mine (inference of QAT int8 model did not accelerate · Issue #1423 · NVIDIA/TensorRT · GitHub).
I found that it’s even slower than ptq ptq_int8.txt (32.5 KB)
Looks like you’re using Jetson platform, May be INT8 is not supported on Jetson hardware, please check preceison support matrix here.
Please allow us some time to test it on V100. Meanwhile we recommend you to please try mixed precision and fp16.
int8 is supported in jetson nx. we got very great speed with PTQ.
We could reproduce the similar behaviour, Please allow us sometime to work on this.
in fact, we could not make sure that the perf of int8 is better than perf of fp16. However, we will still have a look
After our team working on this identified that QAT int inference is slower than fp16 inference is because the model is running in mixed precision. In order to run the whole network with int8 precision, additional Q and DQ layers are required to be inserted between BN and leakyRelu layer.
Please find modified.onnx.
modified.onnx (1.7 MB)
Just wanted to share some of our observations. Hope that helps fellow developers and saves some headaches.
First of all it is recommended to read and re-read Explicit-Quantization part of TensorRT docs, especially Q/DQ Layer-Placement Recommendations section. Most of the behaviour we were trying to get our heads around is actually explained there.
For example, the fact that we might need to place Q/DQ nodes before element-wise addition to fuse and quantise it in residual layers (like in Resnets or CSPDarknets). So build your engine with
--verbose flag, check logs - look for
Engine Layer Information part which we found the most useful, and try experimenting with additional Q/DQ nodes.
Also in addition to @spolisetty’s observations with
LeakyReLU, we also had to place additional Q/DQ nodes between
SiLU layers to force them to run in int8. As we understand and as mentioned in the docs they are not fused and quantised by default because “It’s sometimes useful to preserve the higher-precision dequantized output. For example, we can want to follow the linear operation by an activation function (SiLU, in the following diagram) that requires higher precision to improve network accuracy.” But in our case we did not have any drops in performance with QAT training.
Hope that helps.
Is there a way that I can easily add additional Q and DQ layers between BN and leakyRelu for a pytorch model?
Have a look at Quantizing Resnet50 — pytorch-quantization master documentation. There is an example how a residual quantizer is inserted in
Bottleneck layers. You need to 1) initialize additional quantizer in those layers with BN and LeakyReLU and 2) then change forward call so the input goes from BN to that quantizer firtst and then to LeakyReLU.