Post quantization aware training is slower than fp16 and post quantization

Hi there,

I tried to benchmark int8 and fp16 for mobilenet0.25+ssd in jetson nx with jetpack 4.6.

for post training, i use pytorch-quantization toolkit (TensorRT/tools/pytorch-quantization at master · NVIDIA/TensorRT · GitHub) and generate the calibrated onnx.

But I found out the performance of int8 is much slower than fp16.

with trtexec, fp16 reaches 346.861 qps, and int8 reaches 217.914 qps.

Here is the model with quanziation/dequantization node epoch_15.onnx (1.7 MB)
, and here are the model without quanziation/dequantization node epoch_250.onnx (1.6 MB)

and here is the trtexec log from int8

int8.txt (28.7 KB)

and here is fp16

fp16.txt (30.0 KB)

any idea?

Hi , UFF and Caffe Parser have been deprecated from TensorRT 7 onwards, hence request you to try ONNX parser.
Please check the below link for the same.

Thanks!

I did not use uff nor caffe parser. I am using onnx parser exactly. please look into my question. thank you. this issue is very close to mine (inference of QAT int8 model did not accelerate · Issue #1423 · NVIDIA/TensorRT · GitHub).

I found that it’s even slower than ptq ptq_int8.txt (32.5 KB)

Hi,

Looks like you’re using Jetson platform, May be INT8 is not supported on Jetson hardware, please check preceison support matrix here.
https://docs.nvidia.com/deeplearning/tensorrt/support-matrix/index.html#hardware-precision-matrix

Please allow us some time to test it on V100. Meanwhile we recommend you to please try mixed precision and fp16.

Thank you.

int8 is supported in jetson nx. we got very great speed with PTQ.

Hi,

We could reproduce the similar behaviour, Please allow us sometime to work on this.

Thank you.