Description
We are experiencing extremely long engine building times of 16+ minutes for certain models on Windows when FP16 is enabled. The issue does not occur if FP16 is not enabled or if the GPU does not support fast FP16 (for instance on a GTX 1060), and it does not seem to occur on Linux. We’ve now tested with 7.2, 8.0 and 8.2 and all have the same issue. See this trtexec output:
[02/24/2022-15:24:43] [I] === Build Options ===
[02/24/2022-15:24:43] [I] Max batch: explicit batch
[02/24/2022-15:24:43] [I] Workspace: 3000 MiB
[02/24/2022-15:24:43] [I] minTiming: 1
[02/24/2022-15:24:43] [I] avgTiming: 8
[02/24/2022-15:24:43] [I] Precision: FP32+FP16
[02/24/2022-15:24:43] [I] Calibration:
[02/24/2022-15:24:43] [I] Refit: Disabled
[02/24/2022-15:24:43] [I] Sparsity: Disabled
[02/24/2022-15:24:43] [I] Safe mode: Disabled
[02/24/2022-15:24:43] [I] DirectIO mode: Disabled
[02/24/2022-15:24:43] [I] Restricted mode: Disabled
[02/24/2022-15:24:43] [I] Save engine:
[02/24/2022-15:24:43] [I] Load engine:
[02/24/2022-15:24:43] [I] Profiling verbosity: 0
[02/24/2022-15:24:43] [I] Tactic sources: Using default tactic sources
[02/24/2022-15:24:43] [I] timingCacheMode: local
[02/24/2022-15:24:43] [I] timingCacheFile:
[02/24/2022-15:24:43] [I] Input(s)s format: fp32:CHW
[02/24/2022-15:24:43] [I] Output(s)s format: fp32:CHW
[02/24/2022-15:24:43] [I] Input build shapes: model
[02/24/2022-15:24:43] [I] Input calibration shapes: model
[02/24/2022-15:24:43] [I] === System Options ===
[02/24/2022-15:24:43] [I] Device: 0
[02/24/2022-15:24:43] [I] DLACore:
[02/24/2022-15:24:43] [I] Plugins:
[02/24/2022-15:24:43] [I] === Inference Options ===
[02/24/2022-15:24:43] [I] Batch: Explicit
[02/24/2022-15:24:43] [I] Input inference shapes: model
[02/24/2022-15:24:43] [I] Iterations: 10
[02/24/2022-15:24:43] [I] Duration: 3s (+ 200ms warm up)
[02/24/2022-15:24:43] [I] Sleep time: 0ms
[02/24/2022-15:24:43] [I] Idle time: 0ms
[02/24/2022-15:24:43] [I] Streams: 1
[02/24/2022-15:24:43] [I] ExposeDMA: Disabled
[02/24/2022-15:24:43] [I] Data transfers: Enabled
[02/24/2022-15:24:43] [I] Spin-wait: Disabled
[02/24/2022-15:24:43] [I] Multithreading: Disabled
[02/24/2022-15:24:43] [I] CUDA Graph: Disabled
[02/24/2022-15:24:43] [I] Separate profiling: Disabled
[02/24/2022-15:24:43] [I] Time Deserialize: Disabled
[02/24/2022-15:24:43] [I] Time Refit: Disabled
[02/24/2022-15:24:43] [I] Skip inference: Disabled
[02/24/2022-15:24:43] [I] Inputs:
[02/24/2022-15:24:43] [I] === Reporting Options ===
[02/24/2022-15:24:43] [I] Verbose: Disabled
[02/24/2022-15:24:43] [I] Averages: 10 inferences
[02/24/2022-15:24:43] [I] Percentile: 99
[02/24/2022-15:24:43] [I] Dump refittable layers:Disabled
[02/24/2022-15:24:43] [I] Dump output: Disabled
[02/24/2022-15:24:43] [I] Profile: Disabled
[02/24/2022-15:24:43] [I] Export timing to JSON file:
[02/24/2022-15:24:43] [I] Export output to JSON file:
[02/24/2022-15:24:43] [I] Export profile to JSON file:
[02/24/2022-15:24:43] [I]
[02/24/2022-15:24:44] [I] === Device Information ===
[02/24/2022-15:24:44] [I] Selected Device: NVIDIA GeForce RTX 3060 Laptop GPU
[02/24/2022-15:24:44] [I] Compute Capability: 8.6
[02/24/2022-15:24:44] [I] SMs: 30
[02/24/2022-15:24:44] [I] Compute Clock Rate: 1.702 GHz
[02/24/2022-15:24:44] [I] Device Global Memory: 6143 MiB
[02/24/2022-15:24:44] [I] Shared Memory per SM: 100 KiB
[02/24/2022-15:24:44] [I] Memory Bus Width: 192 bits (ECC disabled)
[02/24/2022-15:24:44] [I] Memory Clock Rate: 7.001 GHz
[02/24/2022-15:24:44] [I]
[02/24/2022-15:24:44] [I] TensorRT version: 8.2.3
[02/24/2022-15:24:44] [I] [TRT] [MemUsageChange] Init CUDA: CPU +571, GPU +0, now: CPU 5976, GPU 1190 (MiB)
[02/24/2022-15:24:44] [I] [TRT] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 6039 MiB, GPU 1190 MiB
[02/24/2022-15:24:45] [I] [TRT] [MemUsageSnapshot] End constructing builder kernel library: CPU 6209 MiB, GPU 1234 MiB
[02/24/2022-15:24:45] [I] Start parsing network model
[02/24/2022-15:24:45] [I] [TRT] ----------------------------------------------------------------
[02/24/2022-15:24:45] [I] [TRT] Input filename: ...\yolov5-zeroed.onnx
[02/24/2022-15:24:45] [I] [TRT] ONNX IR version: 0.0.8
[02/24/2022-15:24:45] [I] [TRT] Opset version: 12
[02/24/2022-15:24:45] [I] [TRT] Producer name:
[02/24/2022-15:24:45] [I] [TRT] Producer version:
[02/24/2022-15:24:45] [I] [TRT] Domain:
[02/24/2022-15:24:45] [I] [TRT] Model version: 0
[02/24/2022-15:24:45] [I] [TRT] Doc string:
[02/24/2022-15:24:45] [I] [TRT] ----------------------------------------------------------------
[02/24/2022-15:24:45] [W] [TRT] onnx2trt_utils.cpp:366: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[02/24/2022-15:24:45] [W] [TRT] onnx2trt_utils.cpp:392: One or more weights outside the range of INT32 was clamped
[02/24/2022-15:24:45] [I] Finish parsing network model
[02/24/2022-15:24:45] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.6.3 but loaded cuBLAS/cuBLAS LT 11.3.0
[02/24/2022-15:24:45] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +724, GPU +266, now: CPU 6917, GPU 1500 (MiB)
[02/24/2022-15:24:46] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +462, GPU +258, now: CPU 7379, GPU 1758 (MiB)
[02/24/2022-15:24:46] [W] [TRT] TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.0.4
[02/24/2022-15:24:46] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[02/24/2022-15:34:28] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[02/24/2022-15:41:15] [I] [TRT] Detected 1 inputs and 4 output network tensors.
[02/24/2022-15:41:16] [I] [TRT] Total Host Persistent Memory: 134480
[02/24/2022-15:41:16] [I] [TRT] Total Device Persistent Memory: 14074880
[02/24/2022-15:41:16] [I] [TRT] Total Scratch Memory: 102912
[02/24/2022-15:41:16] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 17 MiB, GPU 2189 MiB
[02/24/2022-15:41:16] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 11.3655ms to assign 6 blocks to 107 nodes requiring 17203200 bytes.
[02/24/2022-15:41:16] [I] [TRT] Total Activation Memory: 17203200
[02/24/2022-15:41:16] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.6.3 but loaded cuBLAS/cuBLAS LT 11.3.0
[02/24/2022-15:41:16] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 7444, GPU 2004 (MiB)
[02/24/2022-15:41:16] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 7445, GPU 2014 (MiB)
[02/24/2022-15:41:16] [W] [TRT] TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.0.4
[02/24/2022-15:41:16] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +13, GPU +15, now: CPU 13, GPU 15 (MiB)
[02/24/2022-15:41:16] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 7457, GPU 1958 (MiB)
[02/24/2022-15:41:16] [I] [TRT] Loaded engine size: 15 MiB
[02/24/2022-15:41:16] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.6.3 but loaded cuBLAS/cuBLAS LT 11.3.0
[02/24/2022-15:41:16] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 7456, GPU 1986 (MiB)
[02/24/2022-15:41:16] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 7456, GPU 1994 (MiB)
[02/24/2022-15:41:16] [W] [TRT] TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.0.4
[02/24/2022-15:41:16] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +14, now: CPU 0, GPU 14 (MiB)
[02/24/2022-15:41:16] [I] Engine built in 992.085 sec.
Environment
TensorRT Version: 8.2.3
GPU Type: NVIDIA GeForce RTX 3060
Nvidia Driver Version: 511.23
CUDA Version: 11.1
CUDNN Version: 8.0.4
Operating System + Version: Windows 11
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):
Relevant Files
yolov5-zeroed.onnx (27.0 MB)
Steps To Reproduce
Run the following trtexec command with the attached model:
trtexec --onnx="yolov5-zeroed.onnx" --workspace=3000 --fp16