INT8 (8-bit inference, post-training quantization) on Windows 10 is much slower than Ubuntu 20.04

user7278133 · September 17, 2022, 6:57pm

Description

Running the same ONNX model (attached model.onnx), with the same INT8 calibration cache (attached model.calib), on the same hardware, with the same TensorRT + CUDA + CUDNN versions, returns vastly different inference latency on Windows 10 versus Ubuntu 20.04. Please see attached for the trtexec verbose log collected on each OS.

Comparing the layer profiles between the two OS (ubuntu_exportProfile.json vs windows_exportProfile.json), it looks like Windows has much higher latency for every layer, sometime up to 8x - 10x slower. Is this expected?

Looking further, I see that layer Conv_324 is chosen to run with FP32 input & output on Windows for some reason (windows_exportLayerInfo.json:3999), even though the timing result shows that INT8 input returns the fastest runtime (seen at windows_trtexec.log:77957). Why is this?

Environment

TensorRT Version: 8.2.2.1
GPU Type: GeForce RTX 3090
Nvidia Driver Version: 516.94
CUDA Version: 11.6
CUDNN Version: 8.5.0
Operating System + Version: Windows 10, Version: 21H2, Build: 19044.2006
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

model.calib (7.2 KB)
model.onnx (14.3 MB)
ubuntu_exportLayerInfo.json (163.6 KB)
ubuntu_exportProfile.json (24.6 KB)
ubuntu_exportTimes.json (316.0 KB)
ubuntu_trtexec.log (8.6 MB)
windows_exportLayerInfo.json (175.8 KB)
windows_exportProfile.json (26.9 KB)
windows_exportTimes.json (79.4 KB)
windows_trtexec.log (17.4 MB)

Steps To Reproduce

Run trtexec with the command found in ubuntu_trtexec.log on Ubuntu 20.04, and in windows_trtexec.log for Windows 10.

NVES · September 17, 2022, 7:37pm

Hi, Please refer to the below links to perform inference in INT8

Thanks!

user7278133 · September 17, 2022, 7:45pm

Hi, yes I’ve looked at those two links before, but they do not provide an answer for this topic.

spolisetty · September 23, 2022, 10:57am

Hi,

Could you please check if there is any other application that uses GPU during the engine build?
because if we compare the Linux and windows kernel time, we see the windows is not stable.

The one in the red box is especially large, it is unnormal to have a single kernel that has 0.24ms while on Linux it is only 0.01ms
So we guess that another app is using GPU at this time stamp.

Thank you.

user7278133 · September 23, 2022, 7:30pm

Hi, thanks for the reply. I was pretty careful with not running any other application while this is happening, and also reran trtexec several different times on Windows, they all reported similar timing. If you look closer at the per-layer profiling, it shows that every layer takes significantly longer on Windows vs Linux, so it isn’t just this one kernel.

Having said that however, I upgraded to the latest TensorRT version and the timings are much more comparable between the two OS now, so my issue is resolved for now.

system · October 7, 2022, 7:30pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
TRT Engin in INT8 is much slower than FP16 TensorRT	4	1898	November 11, 2021
Inference time mismatch between same configuration on Windows and Ubuntu TensorRT tensorrt , windows-driver	2	650	September 27, 2023
TensorRT int8 slower than FP16 due to reformat layer TensorRT tensorrt , cudnn	0	68	October 11, 2024
Same inference speed for INT8 and FP16 TensorRT	10	5760	October 12, 2021
QAT int8 TRT engine slower than fp16 TensorRT tensorrt , pytorch , python , onnx	3	2205	January 6, 2022
Inference time of tensorrt 6.3 is slower than tensorrt 6.0 TensorRT tensorrt , driveos	7	912	October 12, 2021
High inference time while running UNet with INT8 precision TensorRT tensorrt	5	966	February 10, 2021
Windows slower than Linux TensorRT	1	764	December 15, 2021
TensorRT --- non-int8 fallback when trying to calibrate ONNX model DeepStream SDK tensorrt , deepstream	11	424	July 1, 2024
Int8 performance is less than fp16 TensorRT tensorrt	3	852	September 2, 2022

INT8 (8-bit inference, post-training quantization) on Windows 10 is much slower than Ubuntu 20.04

Description

Environment

Relevant Files

Steps To Reproduce

Related topics