- In
trtexec --loadEngine=saved.engine
output, does Throughput: 81.1558 qps
refer to the max FPS I could get using this engine? Could you elaborate more about what qps is?
Here are the results for a 640x640 yoloV7 engine with different precisions:
-
fp32 - 24.1385 qps
-
fp16 - 48.4616
-
int8 - 81.1158 qps
It all makes sense - lower precision gives more throughput to the engine. However, when I run the engine in deepstream, I do not reach these precisions, I assume this is due to additional overhead of the deepstream pipeline.
- Running yolov4 on deepstream gave me around 140 FPS. When I tried to inspect the engine with
trtexec
, it gave me an error:
[10/03/2023-05:24:21] [I] TensorRT version: 8.5.2
[10/03/2023-05:24:21] [I] Engine loaded in 0.226273 sec.
[10/03/2023-05:24:22] [I] [TRT] Loaded engine size: 156 MiB
[10/03/2023-05:24:23] [E] Error[1]: [pluginV2Runner.cpp::load::300] Error Code 1: Serialization (Serialization assertion creator failed.Cannot deserialize plugin since corresponding IPluginCreator not found in Plugin Registry)
&&&& FAILED TensorRT.trtexec [TensorRT v8502] # trtexec --loadEngine=model_b1_gpu0_fp16.engine
[10/03/2023-05:24:23] [E] Error[4]: [runtime.cpp::deserializeCudaEngine::66] Error Code 4: Internal Error (Engine deserialization failed.)
[10/03/2023-05:24:23] [E] Engine deserialization failed
[10/03/2023-05:24:23] [E] Got invalid engine!
Perhaps you have any idea why this happens?
- I have two devices:
GeForce GTX 1650
and Jetson Orin AGX 64GB
and running the trtexec --loadEngine
gives me similar throughput of ~44 qps.** How is that possible? Isn’t the Orin device much more capable based on these benchmarks?
Here is the full output of the trtexec on Orin:
trtexec --loadEngine=model_b1_gpu0_fp16.engine
&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=model_b1_gpu0_fp16.engine
[10/03/2023-05:31:05] [I] === Model Options ===
[10/03/2023-05:31:05] [I] Format: *
[10/03/2023-05:31:05] [I] Model:
[10/03/2023-05:31:05] [I] Output:
[10/03/2023-05:31:05] [I] === Build Options ===
[10/03/2023-05:31:05] [I] Max batch: 1
[10/03/2023-05:31:05] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[10/03/2023-05:31:05] [I] minTiming: 1
[10/03/2023-05:31:05] [I] avgTiming: 8
[10/03/2023-05:31:05] [I] Precision: FP32
[10/03/2023-05:31:05] [I] LayerPrecisions:
[10/03/2023-05:31:05] [I] Calibration:
[10/03/2023-05:31:05] [I] Refit: Disabled
[10/03/2023-05:31:05] [I] Sparsity: Disabled
[10/03/2023-05:31:05] [I] Safe mode: Disabled
[10/03/2023-05:31:05] [I] DirectIO mode: Disabled
[10/03/2023-05:31:05] [I] Restricted mode: Disabled
[10/03/2023-05:31:05] [I] Build only: Disabled
[10/03/2023-05:31:05] [I] Save engine:
[10/03/2023-05:31:05] [I] Load engine: model_b1_gpu0_fp16.engine
[10/03/2023-05:31:05] [I] Profiling verbosity: 0
[10/03/2023-05:31:05] [I] Tactic sources: Using default tactic sources
[10/03/2023-05:31:05] [I] timingCacheMode: local
[10/03/2023-05:31:05] [I] timingCacheFile:
[10/03/2023-05:31:05] [I] Heuristic: Disabled
[10/03/2023-05:31:05] [I] Preview Features: Use default preview flags.
[10/03/2023-05:31:05] [I] Input(s)s format: fp32:CHW
[10/03/2023-05:31:05] [I] Output(s)s format: fp32:CHW
[10/03/2023-05:31:05] [I] Input build shapes: model
[10/03/2023-05:31:05] [I] Input calibration shapes: model
[10/03/2023-05:31:05] [I] === System Options ===
[10/03/2023-05:31:05] [I] Device: 0
[10/03/2023-05:31:05] [I] DLACore:
[10/03/2023-05:31:05] [I] Plugins:
[10/03/2023-05:31:05] [I] === Inference Options ===
[10/03/2023-05:31:05] [I] Batch: 1
[10/03/2023-05:31:05] [I] Input inference shapes: model
[10/03/2023-05:31:05] [I] Iterations: 10
[10/03/2023-05:31:05] [I] Duration: 3s (+ 200ms warm up)
[10/03/2023-05:31:05] [I] Sleep time: 0ms
[10/03/2023-05:31:05] [I] Idle time: 0ms
[10/03/2023-05:31:05] [I] Streams: 1
[10/03/2023-05:31:05] [I] ExposeDMA: Disabled
[10/03/2023-05:31:05] [I] Data transfers: Enabled
[10/03/2023-05:31:05] [I] Spin-wait: Disabled
[10/03/2023-05:31:05] [I] Multithreading: Disabled
[10/03/2023-05:31:05] [I] CUDA Graph: Disabled
[10/03/2023-05:31:05] [I] Separate profiling: Disabled
[10/03/2023-05:31:05] [I] Time Deserialize: Disabled
[10/03/2023-05:31:05] [I] Time Refit: Disabled
[10/03/2023-05:31:05] [I] NVTX verbosity: 0
[10/03/2023-05:31:05] [I] Persistent Cache Ratio: 0
[10/03/2023-05:31:05] [I] Inputs:
[10/03/2023-05:31:05] [I] === Reporting Options ===
[10/03/2023-05:31:05] [I] Verbose: Disabled
[10/03/2023-05:31:05] [I] Averages: 10 inferences
[10/03/2023-05:31:05] [I] Percentiles: 90,95,99
[10/03/2023-05:31:05] [I] Dump refittable layers:Disabled
[10/03/2023-05:31:05] [I] Dump output: Disabled
[10/03/2023-05:31:05] [I] Profile: Disabled
[10/03/2023-05:31:05] [I] Export timing to JSON file:
[10/03/2023-05:31:05] [I] Export output to JSON file:
[10/03/2023-05:31:05] [I] Export profile to JSON file:
[10/03/2023-05:31:05] [I]
[10/03/2023-05:31:05] [I] === Device Information ===
[10/03/2023-05:31:05] [I] Selected Device: Orin
[10/03/2023-05:31:05] [I] Compute Capability: 8.7
[10/03/2023-05:31:05] [I] SMs: 8
[10/03/2023-05:31:05] [I] Compute Clock Rate: 1.3 GHz
[10/03/2023-05:31:05] [I] Device Global Memory: 62796 MiB
[10/03/2023-05:31:05] [I] Shared Memory per SM: 164 KiB
[10/03/2023-05:31:05] [I] Memory Bus Width: 128 bits (ECC disabled)
[10/03/2023-05:31:05] [I] Memory Clock Rate: 0.612 GHz
[10/03/2023-05:31:05] [I]
[10/03/2023-05:31:05] [I] TensorRT version: 8.5.2
[10/03/2023-05:31:05] [I] Engine loaded in 0.0635924 sec.
[10/03/2023-05:31:05] [I] [TRT] Loaded engine size: 72 MiB
[10/03/2023-05:31:06] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +69, now: CPU 0, GPU 69 (MiB)
[10/03/2023-05:31:06] [I] Engine deserialized in 0.844545 sec.
[10/03/2023-05:31:06] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +58, now: CPU 0, GPU 127 (MiB)
[10/03/2023-05:31:06] [I] Setting persistentCacheLimit to 0 bytes.
[10/03/2023-05:31:06] [I] Using random values for input input
[10/03/2023-05:31:06] [I] Created input binding for input with dimensions 1x3x640x640
[10/03/2023-05:31:06] [I] Using random values for output boxes
[10/03/2023-05:31:06] [I] Created output binding for boxes with dimensions 1x25200x4
[10/03/2023-05:31:06] [I] Using random values for output scores
[10/03/2023-05:31:06] [I] Created output binding for scores with dimensions 1x25200x1
[10/03/2023-05:31:06] [I] Using random values for output classes
[10/03/2023-05:31:06] [I] Created output binding for classes with dimensions 1x25200x1
[10/03/2023-05:31:06] [I] Starting inference
[10/03/2023-05:31:09] [I] Warmup completed 6 queries over 200 ms
[10/03/2023-05:31:09] [I] Timing trace has 136 queries over 3.07206 s
[10/03/2023-05:31:09] [I]
[10/03/2023-05:31:09] [I] === Trace details ===
[10/03/2023-05:31:09] [I] Trace averages of 10 runs:
[10/03/2023-05:31:09] [I] Average on 10 runs - GPU latency: 22.4587 ms - Host latency: 22.8501 ms (enqueue 1.72061 ms)
[10/03/2023-05:31:09] [I] Average on 10 runs - GPU latency: 22.3935 ms - Host latency: 22.7773 ms (enqueue 1.73514 ms)
[10/03/2023-05:31:09] [I] Average on 10 runs - GPU latency: 22.2502 ms - Host latency: 22.6302 ms (enqueue 1.70425 ms)
[10/03/2023-05:31:09] [I] Average on 10 runs - GPU latency: 22.5454 ms - Host latency: 22.9373 ms (enqueue 1.72961 ms)
[10/03/2023-05:31:09] [I] Average on 10 runs - GPU latency: 22.295 ms - Host latency: 22.6798 ms (enqueue 1.73606 ms)
[10/03/2023-05:31:09] [I] Average on 10 runs - GPU latency: 22.4483 ms - Host latency: 22.8367 ms (enqueue 1.65728 ms)
[10/03/2023-05:31:09] [I] Average on 10 runs - GPU latency: 22.4314 ms - Host latency: 22.8232 ms (enqueue 1.5458 ms)
[10/03/2023-05:31:09] [I] Average on 10 runs - GPU latency: 22.4336 ms - Host latency: 22.823 ms (enqueue 1.61167 ms)
[10/03/2023-05:31:09] [I] Average on 10 runs - GPU latency: 22.4293 ms - Host latency: 22.8149 ms (enqueue 1.77466 ms)
[10/03/2023-05:31:09] [I] Average on 10 runs - GPU latency: 22.4159 ms - Host latency: 22.8056 ms (enqueue 1.58232 ms)
[10/03/2023-05:31:09] [I] Average on 10 runs - GPU latency: 22.4242 ms - Host latency: 22.813 ms (enqueue 1.59656 ms)
[10/03/2023-05:31:09] [I] Average on 10 runs - GPU latency: 22.4356 ms - Host latency: 22.8239 ms (enqueue 1.76089 ms)
[10/03/2023-05:31:09] [I] Average on 10 runs - GPU latency: 22.4284 ms - Host latency: 22.8187 ms (enqueue 1.58228 ms)
[10/03/2023-05:31:09] [I]
[10/03/2023-05:31:09] [I] === Performance summary ===
[10/03/2023-05:31:09] [I] Throughput: 44.27 qps
[10/03/2023-05:31:09] [I] Latency: min = 22.4785 ms, max = 23.4652 ms, mean = 22.8028 ms, median = 22.8085 ms, percentile(90%) = 22.8674 ms, percentile(95%) = 23.2997 ms, percentile(99%) = 23.3519 ms
[10/03/2023-05:31:09] [I] Enqueue Time: min = 1.46387 ms, max = 3.01385 ms, mean = 1.66769 ms, median = 1.60083 ms, percentile(90%) = 1.75244 ms, percentile(95%) = 2.78625 ms, percentile(99%) = 2.9436 ms
[10/03/2023-05:31:09] [I] H2D Latency: min = 0.30896 ms, max = 0.353516 ms, mean = 0.325468 ms, median = 0.324768 ms, percentile(90%) = 0.334839 ms, percentile(95%) = 0.337952 ms, percentile(99%) = 0.343323 ms
[10/03/2023-05:31:09] [I] GPU Compute Time: min = 22.1028 ms, max = 23.0645 ms, mean = 22.4151 ms, median = 22.4233 ms, percentile(90%) = 22.4722 ms, percentile(95%) = 22.9161 ms, percentile(99%) = 22.9575 ms
[10/03/2023-05:31:09] [I] D2H Latency: min = 0.0471191 ms, max = 0.0653381 ms, mean = 0.0622159 ms, median = 0.0622559 ms, percentile(90%) = 0.0634155 ms, percentile(95%) = 0.0635986 ms, percentile(99%) = 0.0648193 ms
[10/03/2023-05:31:09] [I] Total Host Walltime: 3.07206 s
[10/03/2023-05:31:09] [I] Total GPU Compute Time: 3.04846 s
[10/03/2023-05:31:09] [I] Explanations of the performance metrics are printed in the verbose logs.
[10/03/2023-05:31:09] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=model_b1_gpu0_fp16.engine
- Perhaps this is related to gpu clocks boosting? Could you elaborate on that?
- I get this warning when running deepstream on the Orin device. Perhaps it’s responsible for such low perfomance?
I would really appreciate your further support on these questions, thanks!