INT8 (8-bit inference, post-training quantization) on Windows 10 is much slower than Ubuntu 20.04

Description

Running the same ONNX model (attached model.onnx), with the same INT8 calibration cache (attached model.calib), on the same hardware, with the same TensorRT + CUDA + CUDNN versions, returns vastly different inference latency on Windows 10 versus Ubuntu 20.04. Please see attached for the trtexec verbose log collected on each OS.

Comparing the layer profiles between the two OS (ubuntu_exportProfile.json vs windows_exportProfile.json), it looks like Windows has much higher latency for every layer, sometime up to 8x - 10x slower. Is this expected?

Looking further, I see that layer Conv_324 is chosen to run with FP32 input & output on Windows for some reason (windows_exportLayerInfo.json:3999), even though the timing result shows that INT8 input returns the fastest runtime (seen at windows_trtexec.log:77957). Why is this?

Environment

TensorRT Version: 8.2.2.1
GPU Type: GeForce RTX 3090
Nvidia Driver Version: 516.94
CUDA Version: 11.6
CUDNN Version: 8.5.0
Operating System + Version: Windows 10, Version: 21H2, Build: 19044.2006
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

model.calib (7.2 KB)
model.onnx (14.3 MB)
ubuntu_exportLayerInfo.json (163.6 KB)
ubuntu_exportProfile.json (24.6 KB)
ubuntu_exportTimes.json (316.0 KB)
ubuntu_trtexec.log (8.6 MB)
windows_exportLayerInfo.json (175.8 KB)
windows_exportProfile.json (26.9 KB)
windows_exportTimes.json (79.4 KB)
windows_trtexec.log (17.4 MB)

Steps To Reproduce

Run trtexec with the command found in ubuntu_trtexec.log on Ubuntu 20.04, and in windows_trtexec.log for Windows 10.

Hi, Please refer to the below links to perform inference in INT8

Thanks!

Hi, yes I’ve looked at those two links before, but they do not provide an answer for this topic.

Hi,

Could you please check if there is any other application that uses GPU during the engine build?
because if we compare the Linux and windows kernel time, we see the windows is not stable.

The one in the red box is especially large, it is unnormal to have a single kernel that has 0.24ms while on Linux it is only 0.01ms
So we guess that another app is using GPU at this time stamp.

Thank you.

Hi, thanks for the reply. I was pretty careful with not running any other application while this is happening, and also reran trtexec several different times on Windows, they all reported similar timing. If you look closer at the per-layer profiling, it shows that every layer takes significantly longer on Windows vs Linux, so it isn’t just this one kernel.

Having said that however, I upgraded to the latest TensorRT version and the timings are much more comparable between the two OS now, so my issue is resolved for now.

1 Like