Extreme engine building time for certain models on Windows with FP16

Description

We are experiencing extremely long engine building times of 16+ minutes for certain models on Windows when FP16 is enabled. The issue does not occur if FP16 is not enabled or if the GPU does not support fast FP16 (for instance on a GTX 1060), and it does not seem to occur on Linux. We’ve now tested with 7.2, 8.0 and 8.2 and all have the same issue. See this trtexec output:

[02/24/2022-15:24:43] [I] === Build Options ===
[02/24/2022-15:24:43] [I] Max batch: explicit batch
[02/24/2022-15:24:43] [I] Workspace: 3000 MiB
[02/24/2022-15:24:43] [I] minTiming: 1
[02/24/2022-15:24:43] [I] avgTiming: 8
[02/24/2022-15:24:43] [I] Precision: FP32+FP16
[02/24/2022-15:24:43] [I] Calibration:
[02/24/2022-15:24:43] [I] Refit: Disabled
[02/24/2022-15:24:43] [I] Sparsity: Disabled
[02/24/2022-15:24:43] [I] Safe mode: Disabled
[02/24/2022-15:24:43] [I] DirectIO mode: Disabled
[02/24/2022-15:24:43] [I] Restricted mode: Disabled
[02/24/2022-15:24:43] [I] Save engine:
[02/24/2022-15:24:43] [I] Load engine:
[02/24/2022-15:24:43] [I] Profiling verbosity: 0
[02/24/2022-15:24:43] [I] Tactic sources: Using default tactic sources
[02/24/2022-15:24:43] [I] timingCacheMode: local
[02/24/2022-15:24:43] [I] timingCacheFile:
[02/24/2022-15:24:43] [I] Input(s)s format: fp32:CHW
[02/24/2022-15:24:43] [I] Output(s)s format: fp32:CHW
[02/24/2022-15:24:43] [I] Input build shapes: model
[02/24/2022-15:24:43] [I] Input calibration shapes: model
[02/24/2022-15:24:43] [I] === System Options ===
[02/24/2022-15:24:43] [I] Device: 0
[02/24/2022-15:24:43] [I] DLACore:
[02/24/2022-15:24:43] [I] Plugins:
[02/24/2022-15:24:43] [I] === Inference Options ===
[02/24/2022-15:24:43] [I] Batch: Explicit
[02/24/2022-15:24:43] [I] Input inference shapes: model
[02/24/2022-15:24:43] [I] Iterations: 10
[02/24/2022-15:24:43] [I] Duration: 3s (+ 200ms warm up)
[02/24/2022-15:24:43] [I] Sleep time: 0ms
[02/24/2022-15:24:43] [I] Idle time: 0ms
[02/24/2022-15:24:43] [I] Streams: 1
[02/24/2022-15:24:43] [I] ExposeDMA: Disabled
[02/24/2022-15:24:43] [I] Data transfers: Enabled
[02/24/2022-15:24:43] [I] Spin-wait: Disabled
[02/24/2022-15:24:43] [I] Multithreading: Disabled
[02/24/2022-15:24:43] [I] CUDA Graph: Disabled
[02/24/2022-15:24:43] [I] Separate profiling: Disabled
[02/24/2022-15:24:43] [I] Time Deserialize: Disabled
[02/24/2022-15:24:43] [I] Time Refit: Disabled
[02/24/2022-15:24:43] [I] Skip inference: Disabled
[02/24/2022-15:24:43] [I] Inputs:
[02/24/2022-15:24:43] [I] === Reporting Options ===
[02/24/2022-15:24:43] [I] Verbose: Disabled
[02/24/2022-15:24:43] [I] Averages: 10 inferences
[02/24/2022-15:24:43] [I] Percentile: 99
[02/24/2022-15:24:43] [I] Dump refittable layers:Disabled
[02/24/2022-15:24:43] [I] Dump output: Disabled
[02/24/2022-15:24:43] [I] Profile: Disabled
[02/24/2022-15:24:43] [I] Export timing to JSON file:
[02/24/2022-15:24:43] [I] Export output to JSON file:
[02/24/2022-15:24:43] [I] Export profile to JSON file:
[02/24/2022-15:24:43] [I]
[02/24/2022-15:24:44] [I] === Device Information ===
[02/24/2022-15:24:44] [I] Selected Device: NVIDIA GeForce RTX 3060 Laptop GPU
[02/24/2022-15:24:44] [I] Compute Capability: 8.6
[02/24/2022-15:24:44] [I] SMs: 30
[02/24/2022-15:24:44] [I] Compute Clock Rate: 1.702 GHz
[02/24/2022-15:24:44] [I] Device Global Memory: 6143 MiB
[02/24/2022-15:24:44] [I] Shared Memory per SM: 100 KiB
[02/24/2022-15:24:44] [I] Memory Bus Width: 192 bits (ECC disabled)
[02/24/2022-15:24:44] [I] Memory Clock Rate: 7.001 GHz
[02/24/2022-15:24:44] [I]
[02/24/2022-15:24:44] [I] TensorRT version: 8.2.3
[02/24/2022-15:24:44] [I] [TRT] [MemUsageChange] Init CUDA: CPU +571, GPU +0, now: CPU 5976, GPU 1190 (MiB)
[02/24/2022-15:24:44] [I] [TRT] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 6039 MiB, GPU 1190 MiB
[02/24/2022-15:24:45] [I] [TRT] [MemUsageSnapshot] End constructing builder kernel library: CPU 6209 MiB, GPU 1234 MiB
[02/24/2022-15:24:45] [I] Start parsing network model
[02/24/2022-15:24:45] [I] [TRT] ----------------------------------------------------------------
[02/24/2022-15:24:45] [I] [TRT] Input filename:   ...\yolov5-zeroed.onnx
[02/24/2022-15:24:45] [I] [TRT] ONNX IR version:  0.0.8
[02/24/2022-15:24:45] [I] [TRT] Opset version:    12
[02/24/2022-15:24:45] [I] [TRT] Producer name:
[02/24/2022-15:24:45] [I] [TRT] Producer version:
[02/24/2022-15:24:45] [I] [TRT] Domain:
[02/24/2022-15:24:45] [I] [TRT] Model version:    0
[02/24/2022-15:24:45] [I] [TRT] Doc string:
[02/24/2022-15:24:45] [I] [TRT] ----------------------------------------------------------------
[02/24/2022-15:24:45] [W] [TRT] onnx2trt_utils.cpp:366: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[02/24/2022-15:24:45] [W] [TRT] onnx2trt_utils.cpp:392: One or more weights outside the range of INT32 was clamped
[02/24/2022-15:24:45] [I] Finish parsing network model
[02/24/2022-15:24:45] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.6.3 but loaded cuBLAS/cuBLAS LT 11.3.0
[02/24/2022-15:24:45] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +724, GPU +266, now: CPU 6917, GPU 1500 (MiB)
[02/24/2022-15:24:46] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +462, GPU +258, now: CPU 7379, GPU 1758 (MiB)
[02/24/2022-15:24:46] [W] [TRT] TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.0.4
[02/24/2022-15:24:46] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[02/24/2022-15:34:28] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[02/24/2022-15:41:15] [I] [TRT] Detected 1 inputs and 4 output network tensors.
[02/24/2022-15:41:16] [I] [TRT] Total Host Persistent Memory: 134480
[02/24/2022-15:41:16] [I] [TRT] Total Device Persistent Memory: 14074880
[02/24/2022-15:41:16] [I] [TRT] Total Scratch Memory: 102912
[02/24/2022-15:41:16] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 17 MiB, GPU 2189 MiB
[02/24/2022-15:41:16] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 11.3655ms to assign 6 blocks to 107 nodes requiring 17203200 bytes.
[02/24/2022-15:41:16] [I] [TRT] Total Activation Memory: 17203200
[02/24/2022-15:41:16] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.6.3 but loaded cuBLAS/cuBLAS LT 11.3.0
[02/24/2022-15:41:16] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 7444, GPU 2004 (MiB)
[02/24/2022-15:41:16] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 7445, GPU 2014 (MiB)
[02/24/2022-15:41:16] [W] [TRT] TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.0.4
[02/24/2022-15:41:16] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +13, GPU +15, now: CPU 13, GPU 15 (MiB)
[02/24/2022-15:41:16] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 7457, GPU 1958 (MiB)
[02/24/2022-15:41:16] [I] [TRT] Loaded engine size: 15 MiB
[02/24/2022-15:41:16] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.6.3 but loaded cuBLAS/cuBLAS LT 11.3.0
[02/24/2022-15:41:16] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 7456, GPU 1986 (MiB)
[02/24/2022-15:41:16] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 7456, GPU 1994 (MiB)
[02/24/2022-15:41:16] [W] [TRT] TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.0.4
[02/24/2022-15:41:16] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +14, now: CPU 0, GPU 14 (MiB)
[02/24/2022-15:41:16] [I] Engine built in 992.085 sec.

Environment

TensorRT Version: 8.2.3
GPU Type: NVIDIA GeForce RTX 3060
Nvidia Driver Version: 511.23
CUDA Version: 11.1
CUDNN Version: 8.0.4
Operating System + Version: Windows 11
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

yolov5-zeroed.onnx (27.0 MB)

Steps To Reproduce

Run the following trtexec command with the attached model:

trtexec --onnx="yolov5-zeroed.onnx" --workspace=3000 --fp16
1 Like

Hi ,
We recommend you to check the supported features from the below link.

You can refer below link for all the supported operators list.
For unsupported operators, you need to create a custom plugin to support the operation

Thanks!

Hello, I don’t understand your response. There are no unsupported operators in this onnx model otherwise we’d see messages in the trtexec output relating to that. Could you please explain how this is relevant to the engine building time?

Is it possible to get an engineer to look at this issue? This is a big problem for us as our application includes several models, sometimes needs to run on systems with multiple GPUs and the engine building process is uninterruptible. I don’t see why there would be such a drastic difference between building the engine on Windows and building the engine on Linux, it seems like there must be some bug on the TensorRT side.

Hi,

We could reproduce this issue. Please allow us some time to work on this.

Thank you.

Okay great, thank you for your response.

Hi @myles.inglis,

Could you please try on the latest TensorRT version 8.4 EA and let us know if you still face this issue on windows env. Please share trtexec --verbose logs. We could get improved build time around 95 sec on the latest version(Linux env).
https://developer.nvidia.com/nvidia-tensorrt-8x-download

Thank you.