Post quantization aware training is slower than fp16 and post quantization

OnePieceOfDeepLearning · September 24, 2021, 10:27am

Hi there,

I tried to benchmark int8 and fp16 for mobilenet0.25+ssd in jetson nx with jetpack 4.6.

for post training, i use pytorch-quantization toolkit (TensorRT/tools/pytorch-quantization at master · NVIDIA/TensorRT · GitHub) and generate the calibrated onnx.

But I found out the performance of int8 is much slower than fp16.

with trtexec, fp16 reaches 346.861 qps, and int8 reaches 217.914 qps.

Here is the model with quanziation/dequantization node epoch_15.onnx (1.7 MB)
, and here are the model without quanziation/dequantization node epoch_250.onnx (1.6 MB)

and here is the trtexec log from int8

int8.txt (28.7 KB)

and here is fp16

fp16.txt (30.0 KB)

any idea?

NVES · September 24, 2021, 10:37am

Hi , UFF and Caffe Parser have been deprecated from TensorRT 7 onwards, hence request you to try ONNX parser.
Please check the below link for the same.

Thanks!

OnePieceOfDeepLearning · September 24, 2021, 3:10pm

I did not use uff nor caffe parser. I am using onnx parser exactly. please look into my question. thank you. this issue is very close to mine (inference of QAT int8 model did not accelerate · Issue #1423 · NVIDIA/TensorRT · GitHub).

OnePieceOfDeepLearning · September 27, 2021, 2:13am

I found that it’s even slower than ptq ptq_int8.txt (32.5 KB)

spolisetty · September 27, 2021, 2:49am

Hi,

Looks like you’re using Jetson platform, May be INT8 is not supported on Jetson hardware, please check preceison support matrix here.

Please allow us some time to test it on V100. Meanwhile we recommend you to please try mixed precision and fp16.

Thank you.

OnePieceOfDeepLearning · September 27, 2021, 3:46am

int8 is supported in jetson nx. we got very great speed with PTQ.

spolisetty · September 30, 2021, 9:34am

Hi,

We could reproduce the similar behaviour, Please allow us sometime to work on this.

Thank you.

krisz · December 29, 2021, 10:26am

in fact, we could not make sure that the perf of int8 is better than perf of fp16. However, we will still have a look

spolisetty · January 25, 2022, 9:10am

Hi,

After our team working on this identified that QAT int inference is slower than fp16 inference is because the model is running in mixed precision. In order to run the whole network with int8 precision, additional Q and DQ layers are required to be inserted between BN and leakyRelu layer.

Please find modified.onnx.
modified.onnx (1.7 MB)

Thank you.

se.zyryanov · February 17, 2022, 12:48am

Hi,

Just wanted to share some of our observations. Hope that helps fellow developers and saves some headaches.

First of all it is recommended to read and re-read Explicit-Quantization part of TensorRT docs, especially Q/DQ Layer-Placement Recommendations section. Most of the behaviour we were trying to get our heads around is actually explained there.

For example, the fact that we might need to place Q/DQ nodes before element-wise addition to fuse and quantise it in residual layers (like in Resnets or CSPDarknets). So build your engine with --verbose flag, check logs - look for Engine Layer Information part which we found the most useful, and try experimenting with additional Q/DQ nodes.

Also in addition to @spolisetty’s observations with LeakyReLU, we also had to place additional Q/DQ nodes between BN and SiLU layers to force them to run in int8. As we understand and as mentioned in the docs they are not fused and quantised by default because “It’s sometimes useful to preserve the higher-precision dequantized output. For example, we can want to follow the linear operation by an activation function (SiLU, in the following diagram) that requires higher precision to improve network accuracy.” But in our case we did not have any drops in performance with QAT training.

Hope that helps.

kylelll · May 16, 2022, 2:55am

Hi,

Is there a way that I can easily add additional Q and DQ layers between BN and leakyRelu for a pytorch model?

se.zyryanov · May 16, 2022, 3:23am

Have a look at Quantizing Resnet50 — pytorch-quantization master documentation. There is an example how a residual quantizer is inserted in BasicBlock and Bottleneck layers. You need to 1) initialize additional quantizer in those layers with BN and LeakyReLU and 2) then change forward call so the input goes from BN to that quantizer firtst and then to LeakyReLU.

Raj1234 · September 25, 2024, 9:56pm

If this issue occurs due to mixed precision (facing something similar right now with ConvNextV2, slower inference on hardware than non quantized version) what would be a solution toward quantizing layernorm?

Topic		Replies	Views
QAT int8 TRT engine slower than fp16 TensorRT tensorrt , pytorch , python , onnx	3	2360	January 6, 2022
[Hugging Face transformer models + pytorch_quantization] PTQ quantization int8 is slower than fp16 TensorRT tensorrt , python , onnx , natural-language-processing-nlp	4	3071	January 6, 2022
How can we know we have convert the onnx to int8trt rather than Float32? TensorRT tensorrt	23	1922	June 14, 2021
Same inference speed for INT8 and FP16 TensorRT	10	5968	October 12, 2021
TensorRT --fp16 pre and post Int8 quantization TensorRT cudnn	1	114	September 2, 2024
How to verify if QAT TRT engine is indeed INT8 on Xavier Jetson AGX Xavier tensorrt	16	625	October 5, 2022
NX & TRT & Jetson-inference - Not setting precision to INT8 Jetson Xavier NX tensorrt , jetson-inference	4	897	October 18, 2021
TRT Engin in INT8 is much slower than FP16 TensorRT	4	2007	November 11, 2021
Data inferencing to INT8U quantized model TensorRT tensorrt	2	431	October 12, 2021
YoloV4 slower in INT8 than FP16 TensorRT	5	1555	June 5, 2021

Post quantization aware training is slower than fp16 and post quantization

Related topics