Description
Running the same ONNX model (attached model.onnx
), with the same INT8 calibration cache (attached model.calib
), on the same hardware, with the same TensorRT + CUDA + CUDNN versions, returns vastly different inference latency on Windows 10 versus Ubuntu 20.04. Please see attached for the trtexec
verbose log collected on each OS.
Comparing the layer profiles between the two OS (ubuntu_exportProfile.json
vs windows_exportProfile.json
), it looks like Windows has much higher latency for every layer, sometime up to 8x - 10x slower. Is this expected?
Looking further, I see that layer Conv_324
is chosen to run with FP32 input & output on Windows for some reason (windows_exportLayerInfo.json:3999
), even though the timing result shows that INT8 input returns the fastest runtime (seen at windows_trtexec.log:77957
). Why is this?
Environment
TensorRT Version: 8.2.2.1
GPU Type: GeForce RTX 3090
Nvidia Driver Version: 516.94
CUDA Version: 11.6
CUDNN Version: 8.5.0
Operating System + Version: Windows 10, Version: 21H2, Build: 19044.2006
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):
Relevant Files
model.calib (7.2 KB)
model.onnx (14.3 MB)
ubuntu_exportLayerInfo.json (163.6 KB)
ubuntu_exportProfile.json (24.6 KB)
ubuntu_exportTimes.json (316.0 KB)
ubuntu_trtexec.log (8.6 MB)
windows_exportLayerInfo.json (175.8 KB)
windows_exportProfile.json (26.9 KB)
windows_exportTimes.json (79.4 KB)
windows_trtexec.log (17.4 MB)
Steps To Reproduce
Run trtexec
with the command found in ubuntu_trtexec.log
on Ubuntu 20.04, and in windows_trtexec.log
for Windows 10.