ConvTranspose + Add Slow

Description

I upgraded TensorRT from 8.2.1 to 8.6.1 recently. The inference time for the same onnx model takes 2x time more than before. After overriding IProfiler class to print the time consumption for each layer, it shows that ConvTranspose + Add in 8.6.1 takes much more time than which in 8.2.1.

8.6.1:
Reformatting CopyNode for Input Tensor 0 to ConvTranspose_102 + Add_103 0.006144ms
ConvTranspose_102 + Add_103 2.00294ms

8.2.1
ConvTranspose_102 0.072704ms
Add_103 0.0256ms

Is that a bug? How can I solve this problem?
Thanks for any help.

Environment

TensorRT Version: 8.6.1
GPU Type: NVIDIA GeForce RTX 3060 Laptop GPU
Nvidia Driver Version: 528.79
CUDA Version: 11.8
CUDNN Version: 8.9.0
Operating System + Version: Windows 11 home 22h2

Hi,

Request you to share the model, script, profiler, and performance output if not shared already so that we can help you better.

Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer to the below links for more details:

Thanks!

8.6.1 trtexec results:
[07/04/2023-15:52:15] [I] === Model Options ===
[07/04/2023-15:52:15] [I] Format: ONNX
[07/04/2023-15:52:15] [I] Model: test.onnx
[07/04/2023-15:52:15] [I] Output:
[07/04/2023-15:52:15] [I] === Build Options ===
[07/04/2023-15:52:15] [I] Max batch: explicit batch
[07/04/2023-15:52:15] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[07/04/2023-15:52:15] [I] minTiming: 1
[07/04/2023-15:52:15] [I] avgTiming: 8
[07/04/2023-15:52:15] [I] Precision: FP32
[07/04/2023-15:52:15] [I] LayerPrecisions:
[07/04/2023-15:52:15] [I] Layer Device Types:
[07/04/2023-15:52:15] [I] Calibration:
[07/04/2023-15:52:15] [I] Refit: Disabled
[07/04/2023-15:52:15] [I] Version Compatible: Disabled
[07/04/2023-15:52:15] [I] TensorRT runtime: full
[07/04/2023-15:52:15] [I] Lean DLL Path:
[07/04/2023-15:52:15] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[07/04/2023-15:52:15] [I] Exclude Lean Runtime: Disabled
[07/04/2023-15:52:15] [I] Sparsity: Disabled
[07/04/2023-15:52:15] [I] Safe mode: Disabled
[07/04/2023-15:52:15] [I] Build DLA standalone loadable: Disabled
[07/04/2023-15:52:15] [I] Allow GPU fallback for DLA: Disabled
[07/04/2023-15:52:15] [I] DirectIO mode: Disabled
[07/04/2023-15:52:15] [I] Restricted mode: Disabled
[07/04/2023-15:52:15] [I] Skip inference: Disabled
[07/04/2023-15:52:15] [I] Save engine: test.trt
[07/04/2023-15:52:15] [I] Load engine:
[07/04/2023-15:52:15] [I] Profiling verbosity: 0
[07/04/2023-15:52:15] [I] Tactic sources: Using default tactic sources
[07/04/2023-15:52:15] [I] timingCacheMode: local
[07/04/2023-15:52:15] [I] timingCacheFile:
[07/04/2023-15:52:15] [I] Heuristic: Disabled
[07/04/2023-15:52:15] [I] Preview Features: Use default preview flags.
[07/04/2023-15:52:15] [I] MaxAuxStreams: -1
[07/04/2023-15:52:15] [I] BuilderOptimizationLevel: -1
[07/04/2023-15:52:15] [I] Input(s)s format: fp32:CHW
[07/04/2023-15:52:15] [I] Output(s)s format: fp32:CHW
[07/04/2023-15:52:15] [I] Input build shapes: model
[07/04/2023-15:52:15] [I] Input calibration shapes: model
[07/04/2023-15:52:15] [I] === System Options ===
[07/04/2023-15:52:15] [I] Device: 0
[07/04/2023-15:52:15] [I] DLACore:
[07/04/2023-15:52:15] [I] Plugins:
[07/04/2023-15:52:15] [I] setPluginsToSerialize:
[07/04/2023-15:52:15] [I] dynamicPlugins:
[07/04/2023-15:52:15] [I] ignoreParsedPluginLibs: 0
[07/04/2023-15:52:15] [I]
[07/04/2023-15:52:15] [I] === Inference Options ===
[07/04/2023-15:52:15] [I] Batch: Explicit
[07/04/2023-15:52:15] [I] Input inference shapes: model
[07/04/2023-15:52:15] [I] Iterations: 10
[07/04/2023-15:52:15] [I] Duration: 3s (+ 200ms warm up)
[07/04/2023-15:52:15] [I] Sleep time: 0ms
[07/04/2023-15:52:15] [I] Idle time: 0ms
[07/04/2023-15:52:15] [I] Inference Streams: 1
[07/04/2023-15:52:15] [I] ExposeDMA: Disabled
[07/04/2023-15:52:15] [I] Data transfers: Enabled
[07/04/2023-15:52:15] [I] Spin-wait: Disabled
[07/04/2023-15:52:15] [I] Multithreading: Disabled
[07/04/2023-15:52:15] [I] CUDA Graph: Disabled
[07/04/2023-15:52:15] [I] Separate profiling: Disabled
[07/04/2023-15:52:15] [I] Time Deserialize: Disabled
[07/04/2023-15:52:15] [I] Time Refit: Disabled
[07/04/2023-15:52:15] [I] NVTX verbosity: 0
[07/04/2023-15:52:15] [I] Persistent Cache Ratio: 0
[07/04/2023-15:52:15] [I] Inputs:
[07/04/2023-15:52:15] [I] === Reporting Options ===
[07/04/2023-15:52:15] [I] Verbose: Disabled
[07/04/2023-15:52:15] [I] Averages: 10 inferences
[07/04/2023-15:52:15] [I] Percentiles: 90,95,99
[07/04/2023-15:52:15] [I] Dump refittable layers:Disabled
[07/04/2023-15:52:15] [I] Dump output: Disabled
[07/04/2023-15:52:15] [I] Profile: Disabled
[07/04/2023-15:52:15] [I] Export timing to JSON file:
[07/04/2023-15:52:15] [I] Export output to JSON file:
[07/04/2023-15:52:15] [I] Export profile to JSON file:
[07/04/2023-15:52:15] [I]
[07/04/2023-15:52:15] [I] === Device Information ===
[07/04/2023-15:52:15] [I] Selected Device: NVIDIA GeForce RTX 3060 Laptop GPU
[07/04/2023-15:52:15] [I] Compute Capability: 8.6
[07/04/2023-15:52:15] [I] SMs: 30
[07/04/2023-15:52:15] [I] Device Global Memory: 6143 MiB
[07/04/2023-15:52:15] [I] Shared Memory per SM: 100 KiB
[07/04/2023-15:52:15] [I] Memory Bus Width: 192 bits (ECC disabled)
[07/04/2023-15:52:15] [I] Application Compute Clock Rate: 1.702 GHz
[07/04/2023-15:52:15] [I] Application Memory Clock Rate: 7.001 GHz
[07/04/2023-15:52:15] [I]
[07/04/2023-15:52:15] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[07/04/2023-15:52:15] [I]
[07/04/2023-15:52:15] [I] TensorRT version: 8.6.1
[07/04/2023-15:52:15] [I] Loading standard plugins
[07/04/2023-15:52:15] [I] [TRT] [MemUsageChange] Init CUDA: CPU +284, GPU +0, now: CPU 12241, GPU 1072 (MiB)
[07/04/2023-15:52:19] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +1162, GPU +264, now: CPU 14543, GPU 1336 (MiB)
[07/04/2023-15:52:19] [W] [TRT] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See “Lazy Loading” section of CUDA documentation CUDA C++ Programming Guide
[07/04/2023-15:52:19] [I] Start parsing network model.
[07/04/2023-15:52:19] [I] [TRT] ----------------------------------------------------------------
[07/04/2023-15:52:19] [I] [TRT] Input filename: test.onnx
[07/04/2023-15:52:19] [I] [TRT] ONNX IR version: 0.0.6
[07/04/2023-15:52:19] [I] [TRT] Opset version: 11
[07/04/2023-15:52:19] [I] [TRT] Producer name: pytorch
[07/04/2023-15:52:19] [I] [TRT] Producer version: 1.9
[07/04/2023-15:52:19] [I] [TRT] Domain:
[07/04/2023-15:52:19] [I] [TRT] Model version: 0
[07/04/2023-15:52:19] [I] [TRT] Doc string:
[07/04/2023-15:52:19] [I] [TRT] ----------------------------------------------------------------
[07/04/2023-15:52:19] [I] Finished parsing network model. Parse time: 0.110906
[07/04/2023-15:52:19] [I] [TRT] Graph optimization time: 0.0062036 seconds.
[07/04/2023-15:52:19] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[07/04/2023-15:52:45] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[07/04/2023-15:52:45] [I] [TRT] Total Host Persistent Memory: 262480
[07/04/2023-15:52:45] [I] [TRT] Total Device Persistent Memory: 5120
[07/04/2023-15:52:45] [I] [TRT] Total Scratch Memory: 8652800
[07/04/2023-15:52:45] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 25 MiB, GPU 116 MiB
[07/04/2023-15:52:45] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 115 steps to complete.
[07/04/2023-15:52:45] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 6.9475ms to assign 10 blocks to 115 nodes requiring 45090304 bytes.
[07/04/2023-15:52:45] [I] [TRT] Total Activation Memory: 45088768
[07/04/2023-15:52:45] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +116, now: CPU 0, GPU 116 (MiB)
[07/04/2023-15:52:46] [I] Engine built in 30.9293 sec.
[07/04/2023-15:52:46] [I] [TRT] Loaded engine size: 117 MiB
[07/04/2023-15:52:46] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +115, now: CPU 0, GPU 115 (MiB)
[07/04/2023-15:52:46] [I] Engine deserialized in 0.0321676 sec.
[07/04/2023-15:52:46] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +43, now: CPU 0, GPU 158 (MiB)
[07/04/2023-15:52:46] [W] [TRT] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See “Lazy Loading” section of CUDA documentation CUDA C++ Programming Guide
[07/04/2023-15:52:46] [I] Setting persistentCacheLimit to 0 bytes.
[07/04/2023-15:52:46] [I] Using random values for input input
[07/04/2023-15:52:46] [I] Input binding for input with dimensions 1x3x512x512 is created.
[07/04/2023-15:52:46] [I] Output binding for output with dimensions 1x64x128x128 is created.
[07/04/2023-15:52:46] [I] Starting inference
[07/04/2023-15:52:49] [I] Warmup completed 6 queries over 200 ms
[07/04/2023-15:52:49] [I] Timing trace has 89 queries over 3.0655 s
[07/04/2023-15:52:49] [I]
[07/04/2023-15:52:49] [I] === Trace details ===
[07/04/2023-15:52:49] [I] Trace averages of 10 runs:
[07/04/2023-15:52:49] [I] Average on 10 runs - GPU latency: 33.0017 ms - Host latency: 33.5733 ms (enqueue 33.4361 ms)
[07/04/2023-15:52:49] [I] Average on 10 runs - GPU latency: 34.3227 ms - Host latency: 34.8911 ms (enqueue 34.0019 ms)
[07/04/2023-15:52:49] [I] Average on 10 runs - GPU latency: 35.0873 ms - Host latency: 35.6554 ms (enqueue 35.6215 ms)
[07/04/2023-15:52:49] [I] Average on 10 runs - GPU latency: 33.5733 ms - Host latency: 34.1425 ms (enqueue 34.2074 ms)
[07/04/2023-15:52:49] [I] Average on 10 runs - GPU latency: 32.4652 ms - Host latency: 33.0368 ms (enqueue 32.9492 ms)
[07/04/2023-15:52:49] [I] Average on 10 runs - GPU latency: 34.9549 ms - Host latency: 35.5234 ms (enqueue 35.1269 ms)
[07/04/2023-15:52:49] [I] Average on 10 runs - GPU latency: 35.5306 ms - Host latency: 36.0989 ms (enqueue 36.6388 ms)
[07/04/2023-15:52:49] [I] Average on 10 runs - GPU latency: 32.14 ms - Host latency: 32.709 ms (enqueue 32.3849 ms)
[07/04/2023-15:52:49] [I]
[07/04/2023-15:52:49] [I] === Performance summary ===
[07/04/2023-15:52:49] [I] Throughput: 29.0328 qps
[07/04/2023-15:52:49] [I] Latency: min = 29.62 ms, max = 40.5161 ms, mean = 34.3186 ms, median = 34.5432 ms, percentile(90%) = 37.9171 ms, percentile(95%) = 38.6229 ms, percentile(99%) = 40.5161 ms
[07/04/2023-15:52:49] [I] Enqueue Time: min = 22.5655 ms, max = 44.9667 ms, mean = 34.2031 ms, median = 33.9287 ms, percentile(90%) = 39.4993 ms, percentile(95%) = 40.7681 ms, percentile(99%) = 44.9667 ms
[07/04/2023-15:52:49] [I] H2D Latency: min = 0.245361 ms, max = 0.272644 ms, mean = 0.246794 ms, median = 0.246399 ms, percentile(90%) = 0.24707 ms, percentile(95%) = 0.247925 ms, percentile(99%) = 0.272644 ms
[07/04/2023-15:52:49] [I] GPU Compute Time: min = 29.0519 ms, max = 39.9473 ms, mean = 33.7494 ms, median = 33.9743 ms, percentile(90%) = 37.3484 ms, percentile(95%) = 38.0539 ms, percentile(99%) = 39.9473 ms
[07/04/2023-15:52:49] [I] D2H Latency: min = 0.321533 ms, max = 0.346924 ms, mean = 0.322436 ms, median = 0.321777 ms, percentile(90%) = 0.322266 ms, percentile(95%) = 0.325439 ms, percentile(99%) = 0.346924 ms
[07/04/2023-15:52:49] [I] Total Host Walltime: 3.0655 s
[07/04/2023-15:52:49] [I] Total GPU Compute Time: 3.00369 s
[07/04/2023-15:52:49] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[07/04/2023-15:52:49] [W] If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[07/04/2023-15:52:49] [W] * GPU compute time is unstable, with coefficient of variance = 8.9478%.
[07/04/2023-15:52:49] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/04/2023-15:52:49] [I] Explanations of the performance metrics are printed in the verbose logs.

8.2.1 trtexec results:
[07/04/2023-16:00:00] [I] === Model Options ===
[07/04/2023-16:00:00] [I] Format: ONNX
[07/04/2023-16:00:00] [I] Model: test.onnx
[07/04/2023-16:00:00] [I] Output:
[07/04/2023-16:00:00] [I] === Build Options ===
[07/04/2023-16:00:00] [I] Max batch: explicit batch
[07/04/2023-16:00:00] [I] Workspace: 16 MiB
[07/04/2023-16:00:00] [I] minTiming: 1
[07/04/2023-16:00:00] [I] avgTiming: 8
[07/04/2023-16:00:00] [I] Precision: FP32
[07/04/2023-16:00:00] [I] Calibration:
[07/04/2023-16:00:00] [I] Refit: Disabled
[07/04/2023-16:00:00] [I] Sparsity: Disabled
[07/04/2023-16:00:00] [I] Safe mode: Disabled
[07/04/2023-16:00:00] [I] DirectIO mode: Disabled
[07/04/2023-16:00:00] [I] Restricted mode: Disabled
[07/04/2023-16:00:00] [I] Save engine: test.trt
[07/04/2023-16:00:00] [I] Load engine:
[07/04/2023-16:00:00] [I] Profiling verbosity: 0
[07/04/2023-16:00:00] [I] Tactic sources: Using default tactic sources
[07/04/2023-16:00:00] [I] timingCacheMode: local
[07/04/2023-16:00:00] [I] timingCacheFile:
[07/04/2023-16:00:00] [I] Input(s)s format: fp32:CHW
[07/04/2023-16:00:00] [I] Output(s)s format: fp32:CHW
[07/04/2023-16:00:00] [I] Input build shapes: model
[07/04/2023-16:00:00] [I] Input calibration shapes: model
[07/04/2023-16:00:00] [I] === System Options ===
[07/04/2023-16:00:00] [I] Device: 0
[07/04/2023-16:00:00] [I] DLACore:
[07/04/2023-16:00:00] [I] Plugins:
[07/04/2023-16:00:00] [I] === Inference Options ===
[07/04/2023-16:00:00] [I] Batch: Explicit
[07/04/2023-16:00:00] [I] Input inference shapes: model
[07/04/2023-16:00:00] [I] Iterations: 10
[07/04/2023-16:00:00] [I] Duration: 3s (+ 200ms warm up)
[07/04/2023-16:00:00] [I] Sleep time: 0ms
[07/04/2023-16:00:00] [I] Idle time: 0ms
[07/04/2023-16:00:00] [I] Streams: 1
[07/04/2023-16:00:00] [I] ExposeDMA: Disabled
[07/04/2023-16:00:00] [I] Data transfers: Enabled
[07/04/2023-16:00:00] [I] Spin-wait: Disabled
[07/04/2023-16:00:00] [I] Multithreading: Disabled
[07/04/2023-16:00:00] [I] CUDA Graph: Disabled
[07/04/2023-16:00:00] [I] Separate profiling: Disabled
[07/04/2023-16:00:00] [I] Time Deserialize: Disabled
[07/04/2023-16:00:00] [I] Time Refit: Disabled
[07/04/2023-16:00:00] [I] Skip inference: Disabled
[07/04/2023-16:00:00] [I] Inputs:
[07/04/2023-16:00:00] [I] === Reporting Options ===
[07/04/2023-16:00:00] [I] Verbose: Disabled
[07/04/2023-16:00:00] [I] Averages: 10 inferences
[07/04/2023-16:00:00] [I] Percentile: 99
[07/04/2023-16:00:00] [I] Dump refittable layers:Disabled
[07/04/2023-16:00:00] [I] Dump output: Disabled
[07/04/2023-16:00:00] [I] Profile: Disabled
[07/04/2023-16:00:00] [I] Export timing to JSON file:
[07/04/2023-16:00:00] [I] Export output to JSON file:
[07/04/2023-16:00:00] [I] Export profile to JSON file:
[07/04/2023-16:00:00] [I]
[07/04/2023-16:00:00] [I] === Device Information ===
[07/04/2023-16:00:00] [I] Selected Device: NVIDIA GeForce RTX 3060 Laptop GPU
[07/04/2023-16:00:00] [I] Compute Capability: 8.6
[07/04/2023-16:00:00] [I] SMs: 30
[07/04/2023-16:00:00] [I] Compute Clock Rate: 1.702 GHz
[07/04/2023-16:00:00] [I] Device Global Memory: 6143 MiB
[07/04/2023-16:00:00] [I] Shared Memory per SM: 100 KiB
[07/04/2023-16:00:00] [I] Memory Bus Width: 192 bits (ECC disabled)
[07/04/2023-16:00:00] [I] Memory Clock Rate: 7.001 GHz
[07/04/2023-16:00:00] [I]
[07/04/2023-16:00:00] [I] TensorRT version: 8.2.1
[07/04/2023-16:00:00] [I] [TRT] [MemUsageChange] Init CUDA: CPU +277, GPU +0, now: CPU 10622, GPU 1072 (MiB)
[07/04/2023-16:00:11] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +1199, GPU +264, now: CPU 12819, GPU 1336 (MiB)
[07/04/2023-16:00:11] [W] [TRT] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See “Lazy Loading” section of CUDA documentation CUDA C++ Programming Guide
[07/04/2023-16:00:11] [I] Start parsing network model
[07/04/2023-16:00:11] [I] [TRT] ----------------------------------------------------------------
[07/04/2023-16:00:11] [I] [TRT] Input filename: test.onnx
[07/04/2023-16:00:11] [I] [TRT] ONNX IR version: 0.0.6
[07/04/2023-16:00:11] [I] [TRT] Opset version: 11
[07/04/2023-16:00:11] [I] [TRT] Producer name: pytorch
[07/04/2023-16:00:11] [I] [TRT] Producer version: 1.9
[07/04/2023-16:00:11] [I] [TRT] Domain:
[07/04/2023-16:00:11] [I] [TRT] Model version: 0
[07/04/2023-16:00:11] [I] [TRT] Doc string:
[07/04/2023-16:00:11] [I] [TRT] ----------------------------------------------------------------
[07/04/2023-16:00:11] [I] Finish parsing network model
[07/04/2023-16:00:11] [I] [TRT] Graph optimization time: 0.0065841 seconds.
[07/04/2023-16:00:11] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[07/04/2023-16:00:33] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[07/04/2023-16:00:33] [I] [TRT] Total Host Persistent Memory: 260208
[07/04/2023-16:00:33] [I] [TRT] Total Device Persistent Memory: 236032
[07/04/2023-16:00:33] [I] [TRT] Total Scratch Memory: 2098176
[07/04/2023-16:00:33] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 25 MiB, GPU 119 MiB
[07/04/2023-16:00:33] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 110 steps to complete.
[07/04/2023-16:00:33] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 6.444ms to assign 8 blocks to 110 nodes requiring 45089280 bytes.
[07/04/2023-16:00:33] [I] [TRT] Total Activation Memory: 45088768
[07/04/2023-16:00:33] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +115, now: CPU 0, GPU 115 (MiB)
[07/04/2023-16:00:33] [I] [TRT] Loaded engine size: 117 MiB
[07/04/2023-16:00:33] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +114, now: CPU 0, GPU 114 (MiB)
[07/04/2023-16:00:33] [E] Error[3]: [runtime.cpp::nvinfer1::Runtime::~Runtime::346] Error Code 3: API Usage Error (Parameter check failed at: runtime.cpp::nvinfer1::Runtime::~Runtime::346, condition: mEngineCounter.use_count() == 1. Destroying a runtime before destroying deserialized engines created by the runtime leads to undefined behavior.
)
[07/04/2023-16:00:33] [E] Error[3]: [builder.cpp::nvinfer1::builder::Builder::~Builder::341] Error Code 3: API Usage Error (Parameter check failed at: builder.cpp::nvinfer1::builder::Builder::~Builder::341, condition: mObjectCounter.use_count() == 1. Destroying a builder object before destroying objects it created leads to undefined behavior.
)
[07/04/2023-16:00:33] [I] Engine built in 33.3251 sec.
[07/04/2023-16:00:33] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +43, now: CPU 0, GPU 157 (MiB)
[07/04/2023-16:00:33] [W] [TRT] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See “Lazy Loading” section of CUDA documentation CUDA C++ Programming Guide
[07/04/2023-16:00:33] [I] Using random values for input input
[07/04/2023-16:00:33] [I] Created input binding for input with dimensions 1x3x512x512
[07/04/2023-16:00:34] [I] Using random values for output output
[07/04/2023-16:00:34] [I] Created output binding for output with dimensions 1x64x128x128
[07/04/2023-16:00:34] [I] Starting inference
[07/04/2023-16:00:37] [I] Warmup completed 10 queries over 200 ms
[07/04/2023-16:00:37] [I] Timing trace has 148 queries over 3.03358 s
[07/04/2023-16:00:37] [I]
[07/04/2023-16:00:37] [I] === Trace details ===
[07/04/2023-16:00:37] [I] Trace averages of 10 runs:
[07/04/2023-16:00:37] [I] Average on 10 runs - GPU latency: 21.458 ms - Host latency: 22.0232 ms (end to end 22.1509 ms, enqueue 21.1855 ms)
[07/04/2023-16:00:37] [I] Average on 10 runs - GPU latency: 20.8364 ms - Host latency: 21.4018 ms (end to end 21.5103 ms, enqueue 20.0202 ms)
[07/04/2023-16:00:37] [I] Average on 10 runs - GPU latency: 19.4282 ms - Host latency: 19.9976 ms (end to end 20.1049 ms, enqueue 19.8673 ms)
[07/04/2023-16:00:37] [I] Average on 10 runs - GPU latency: 19.3706 ms - Host latency: 19.9353 ms (end to end 20.0339 ms, enqueue 18.2692 ms)
[07/04/2023-16:00:37] [I] Average on 10 runs - GPU latency: 21.314 ms - Host latency: 21.8795 ms (end to end 22.0339 ms, enqueue 20.7183 ms)
[07/04/2023-16:00:37] [I] Average on 10 runs - GPU latency: 19.4384 ms - Host latency: 20.0034 ms (end to end 20.101 ms, enqueue 19.0662 ms)
[07/04/2023-16:00:37] [I] Average on 10 runs - GPU latency: 18.748 ms - Host latency: 19.3135 ms (end to end 19.4005 ms, enqueue 19.0619 ms)
[07/04/2023-16:00:37] [I] Average on 10 runs - GPU latency: 18.4604 ms - Host latency: 19.0252 ms (end to end 19.11 ms, enqueue 16.8491 ms)
[07/04/2023-16:00:37] [I] Average on 10 runs - GPU latency: 18.3616 ms - Host latency: 18.9261 ms (end to end 19.0176 ms, enqueue 18.4146 ms)
[07/04/2023-16:00:37] [I] Average on 10 runs - GPU latency: 20.4157 ms - Host latency: 20.9807 ms (end to end 21.1246 ms, enqueue 19.4072 ms)
[07/04/2023-16:00:37] [I] Average on 10 runs - GPU latency: 19.2681 ms - Host latency: 19.8331 ms (end to end 19.9717 ms, enqueue 18.7269 ms)
[07/04/2023-16:00:37] [I] Average on 10 runs - GPU latency: 20.3905 ms - Host latency: 20.9555 ms (end to end 21.0898 ms, enqueue 20.3416 ms)
[07/04/2023-16:00:37] [I] Average on 10 runs - GPU latency: 20.8597 ms - Host latency: 21.425 ms (end to end 21.5364 ms, enqueue 21.4828 ms)
[07/04/2023-16:00:37] [I] Average on 10 runs - GPU latency: 19.6132 ms - Host latency: 20.1783 ms (end to end 20.2763 ms, enqueue 19.0764 ms)
[07/04/2023-16:00:37] [I]
[07/04/2023-16:00:37] [I] === Performance summary ===
[07/04/2023-16:00:37] [I] Throughput: 48.7872 qps
[07/04/2023-16:00:37] [I] Latency: min = 16.9426 ms, max = 25.7742 ms, mean = 20.3844 ms, median = 21.1941 ms, percentile(99%) = 25.6042 ms
[07/04/2023-16:00:37] [I] End-to-End Host Latency: min = 17.0652 ms, max = 25.8479 ms, mean = 20.4956 ms, median = 21.315 ms, percentile(99%) = 25.7107 ms
[07/04/2023-16:00:37] [I] Enqueue Time: min = 9.98462 ms, max = 27.6577 ms, mean = 19.4512 ms, median = 19.9514 ms, percentile(99%) = 24.9242 ms
[07/04/2023-16:00:37] [I] H2D Latency: min = 0.243652 ms, max = 0.251709 ms, mean = 0.24467 ms, median = 0.244629 ms, percentile(99%) = 0.245728 ms
[07/04/2023-16:00:37] [I] GPU Compute Time: min = 16.3779 ms, max = 25.2097 ms, mean = 19.819 ms, median = 20.6289 ms, percentile(99%) = 25.0398 ms
[07/04/2023-16:00:37] [I] D2H Latency: min = 0.320068 ms, max = 0.361877 ms, mean = 0.32066 ms, median = 0.320313 ms, percentile(99%) = 0.32251 ms
[07/04/2023-16:00:37] [I] Total Host Walltime: 3.03358 s
[07/04/2023-16:00:37] [I] Total GPU Compute Time: 2.93322 s
[07/04/2023-16:00:37] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[07/04/2023-16:00:37] [W] If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[07/04/2023-16:00:37] [I] Explanations of the performance metrics are printed in the verbose logs.

The onnx model:

Hi,

We have a similar known issue; please allow us some time to work on this issue.

Thank you.