Little performance difference between int8 and fp16 on RTX2080

I have a segmentation model in onnx format and use trtexec to convert it to int8 and fp16 model. However, trtexec output shows almost no difference in terms of execution time between int8 and fp16 on RTX2080. I expect int8 should run almost 2x faster than fp16.
I use the following commands to convert my onnx to fp16 and int8 trt engine. More details are below. Are my conversion scripts correct?

fp16:
./trtexec --onnx=fcn_hr18_512x1024_160k_cityscapes_20200602_190822-221e4a4f.onnx --fp16 --saveEngine=fcn_hr18_512x1024_160k_cityscapes_20200602_190822-221e4a4f_fp16.trt --dumpProfile --exportTimes=fcn_hr18_512x1024_160k_cityscapes_20200602_190822-221e4a4f_fp16.json

GPU compute time for fp16:
[06/17/2021-18:54:01] [I] Enqueue Time
[06/17/2021-18:54:01] [I] min: 10.1111 ms
[06/17/2021-18:54:01] [I] max: 11.0121 ms
[06/17/2021-18:54:01] [I] median: 10.2297 ms
[06/17/2021-18:54:01] [I] GPU Compute
[06/17/2021-18:54:01] [I] min: 9.04541 ms
[06/17/2021-18:54:01] [I] max: 9.98022 ms
[06/17/2021-18:54:01] [I] mean: 9.19428 ms
[06/17/2021-18:54:01] [I] median: 9.17056 ms
[06/17/2021-18:54:01] [I] percentile: 9.88184 ms at 99%

int8:

./trtexec --onnx=fcn_hr18_512x1024_160k_cityscapes_20200602_190822-221e4a4f.onnx --int8 --saveEngine=fcn_hr18_512x1024_160k_cityscapes_20200602_190822-221e4a4f_int8.trt --dumpProfile --exportTimes=fcn_hr18_512x1024_160k_cityscapes_20200602_190822-221e4a4f_int8.json

GPU compute time for int 8:

[06/17/2021-19:11:44] [I] Enqueue Time
[06/17/2021-19:11:44] [I] min: 9.42017 ms
[06/17/2021-19:11:44] [I] max: 10.7423 ms
[06/17/2021-19:11:44] [I] median: 9.50635 ms
[06/17/2021-19:11:44] [I] GPU Compute
[06/17/2021-19:11:44] [I] min: 8.3761 ms
[06/17/2021-19:11:44] [I] max: 9.70874 ms
[06/17/2021-19:11:44] [I] mean: 8.55703 ms
[06/17/2021-19:11:44] [I] median: 8.45605 ms
[06/17/2021-19:11:44] [I] percentile: 9.62769 ms at 99%

Thanks,

Hi, Please refer to the below links to perform inference in INT8

Thanks!

Hi @fcj,

Hope this will help you.

Could you please share complete verbose logs for both and if possible ONNX model to try from our end for better assistance.

Thank you.

@spolisetty

You can download the onnx below;

https://drive.google.com/file/d/1s97AcGU1TVpSA9JkMPjGqhQ1kUDQL8u5/view?usp=sharing

Using the following command to convert int8 trt:
./trtexec --onnx=fcn_hr18_512x1024_160k_cityscapes_20200602_190822-221e4a4f.onnx --int8 --saveEngine=fcn_hr18_512x1024_160k_cityscapes_20200602_190822
-221e4a4f_int8.trt --dumpProfile --exportTimes=fcn_hr18_512x1024_160k_cityscapes_20200602_190822-221e4a4f_int8.json

The log can be downloaded below:
https://drive.google.com/file/d/1PmwsaIISSVeOt2DnPUJaxieEv4CUZ-fk/view?usp=sharing

@fcj,

When we tried from our end observed little difference.
It’s possible if many layers end up falling back to fp32.
You’d probably want to enable both int8 and fp16 in such case.

We recommend you t please try on latest TensorRT 8.0 GA version.

Thank you.