I got engine format fixed, but performance didn’t improve much, just from 10FPS to 13 FPS. Still got issue.
EDIT: And it can’t predict!!!
$ /usr/src/tensorrt/bin/trtexec --loadEngine=./model/yolo11n.engine
&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=./model/yolo11n.engine
[11/19/2024-17:21:38] [I] === Model Options ===
[11/19/2024-17:21:38] [I] Format: *
[11/19/2024-17:21:38] [I] Model:
[11/19/2024-17:21:38] [I] Output:
[11/19/2024-17:21:38] [I] === Build Options ===
[11/19/2024-17:21:38] [I] Max batch: 1
[11/19/2024-17:21:38] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[11/19/2024-17:21:38] [I] minTiming: 1
[11/19/2024-17:21:38] [I] avgTiming: 8
[11/19/2024-17:21:38] [I] Precision: FP32
[11/19/2024-17:21:38] [I] LayerPrecisions:
[11/19/2024-17:21:38] [I] Calibration:
[11/19/2024-17:21:38] [I] Refit: Disabled
[11/19/2024-17:21:38] [I] Sparsity: Disabled
[11/19/2024-17:21:38] [I] Safe mode: Disabled
[11/19/2024-17:21:38] [I] DirectIO mode: Disabled
[11/19/2024-17:21:38] [I] Restricted mode: Disabled
[11/19/2024-17:21:38] [I] Build only: Disabled
[11/19/2024-17:21:38] [I] Save engine:
[11/19/2024-17:21:38] [I] Load engine: ./model/yolo11n.engine
[11/19/2024-17:21:38] [I] Profiling verbosity: 0
[11/19/2024-17:21:38] [I] Tactic sources: Using default tactic sources
[11/19/2024-17:21:38] [I] timingCacheMode: local
[11/19/2024-17:21:38] [I] timingCacheFile:
[11/19/2024-17:21:38] [I] Heuristic: Disabled
[11/19/2024-17:21:38] [I] Preview Features: Use default preview flags.
[11/19/2024-17:21:38] [I] Input(s)s format: fp32:CHW
[11/19/2024-17:21:38] [I] Output(s)s format: fp32:CHW
[11/19/2024-17:21:38] [I] Input build shapes: model
[11/19/2024-17:21:38] [I] Input calibration shapes: model
[11/19/2024-17:21:38] [I] === System Options ===
[11/19/2024-17:21:38] [I] Device: 0
[11/19/2024-17:21:38] [I] DLACore:
[11/19/2024-17:21:38] [I] Plugins:
[11/19/2024-17:21:38] [I] === Inference Options ===
[11/19/2024-17:21:38] [I] Batch: 1
[11/19/2024-17:21:38] [I] Input inference shapes: model
[11/19/2024-17:21:38] [I] Iterations: 10
[11/19/2024-17:21:38] [I] Duration: 3s (+ 200ms warm up)
[11/19/2024-17:21:38] [I] Sleep time: 0ms
[11/19/2024-17:21:38] [I] Idle time: 0ms
[11/19/2024-17:21:38] [I] Streams: 1
[11/19/2024-17:21:38] [I] ExposeDMA: Disabled
[11/19/2024-17:21:38] [I] Data transfers: Enabled
[11/19/2024-17:21:38] [I] Spin-wait: Disabled
[11/19/2024-17:21:38] [I] Multithreading: Disabled
[11/19/2024-17:21:38] [I] CUDA Graph: Disabled
[11/19/2024-17:21:38] [I] Separate profiling: Disabled
[11/19/2024-17:21:38] [I] Time Deserialize: Disabled
[11/19/2024-17:21:38] [I] Time Refit: Disabled
[11/19/2024-17:21:38] [I] NVTX verbosity: 0
[11/19/2024-17:21:38] [I] Persistent Cache Ratio: 0
[11/19/2024-17:21:38] [I] Inputs:
[11/19/2024-17:21:38] [I] === Reporting Options ===
[11/19/2024-17:21:38] [I] Verbose: Disabled
[11/19/2024-17:21:38] [I] Averages: 10 inferences
[11/19/2024-17:21:38] [I] Percentiles: 90,95,99
[11/19/2024-17:21:38] [I] Dump refittable layers:Disabled
[11/19/2024-17:21:38] [I] Dump output: Disabled
[11/19/2024-17:21:38] [I] Profile: Disabled
[11/19/2024-17:21:38] [I] Export timing to JSON file:
[11/19/2024-17:21:38] [I] Export output to JSON file:
[11/19/2024-17:21:38] [I] Export profile to JSON file:
[11/19/2024-17:21:38] [I]
[11/19/2024-17:21:38] [I] === Device Information ===
[11/19/2024-17:21:38] [I] Selected Device: Orin
[11/19/2024-17:21:38] [I] Compute Capability: 8.7
[11/19/2024-17:21:38] [I] SMs: 8
[11/19/2024-17:21:38] [I] Compute Clock Rate: 0.624 GHz
[11/19/2024-17:21:38] [I] Device Global Memory: 7451 MiB
[11/19/2024-17:21:38] [I] Shared Memory per SM: 164 KiB
[11/19/2024-17:21:38] [I] Memory Bus Width: 128 bits (ECC disabled)
[11/19/2024-17:21:38] [I] Memory Clock Rate: 0.624 GHz
[11/19/2024-17:21:38] [I]
[11/19/2024-17:21:38] [I] TensorRT version: 8.5.2
[11/19/2024-17:21:38] [I] Engine loaded in 0.0135807 sec.
[11/19/2024-17:21:38] [I] [TRT] Loaded engine size: 11 MiB
[11/19/2024-17:21:40] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +616, GPU +586, now: CPU 907, GPU 3559 (MiB)
[11/19/2024-17:21:40] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +10, now: CP U 0, GPU 10 (MiB)
[11/19/2024-17:21:40] [I] Engine deserialized in 1.89538 sec.
[11/19/2024-17:21:40] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 908, GPU 3559 (MiB)
[11/19/2024-17:21:40] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +18, now : CPU 0, GPU 28 (MiB)
[11/19/2024-17:21:40] [I] Setting persistentCacheLimit to 0 bytes.
[11/19/2024-17:21:40] [I] Using random values for input images
[11/19/2024-17:21:40] [I] Created input binding for images with dimensions 1x3x640x640
[11/19/2024-17:21:40] [I] Using random values for output output0
[11/19/2024-17:21:40] [I] Created output binding for output0 with dimensions 1x84x8400
[11/19/2024-17:21:40] [I] Starting inference
[11/19/2024-17:21:43] [I] Warmup completed 1 queries over 200 ms
[11/19/2024-17:21:43] [I] Timing trace has 168 queries over 2.0829 s
[11/19/2024-17:21:43] [I]
[11/19/2024-17:21:43] [I] === Trace details ===
[11/19/2024-17:21:43] [I] Trace averages of 10 runs:
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.377 ms - Host latency: 13.1714 ms (enqueue 2.26556 ms)
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.3984 ms - Host latency: 13.23 ms (enqueue 2.01779 ms)
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.4013 ms - Host latency: 13.2285 ms (enqueue 1.99799 ms)
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.3937 ms - Host latency: 13.2252 ms (enqueue 1.96305 ms)
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.3914 ms - Host latency: 13.2226 ms (enqueue 1.96279 ms)
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.389 ms - Host latency: 13.2236 ms (enqueue 1.96001 ms)
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.3958 ms - Host latency: 13.2285 ms (enqueue 1.9571 ms)
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.3916 ms - Host latency: 13.2211 ms (enqueue 1.96346 ms)
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.3975 ms - Host latency: 13.2322 ms (enqueue 1.95918 ms)
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.3922 ms - Host latency: 13.2249 ms (enqueue 1.96177 ms)
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.3974 ms - Host latency: 13.2266 ms (enqueue 2.03279 ms)
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.3936 ms - Host latency: 13.2238 ms (enqueue 1.9748 ms)
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.3925 ms - Host latency: 13.2229 ms (enqueue 1.9603 ms)
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.3931 ms - Host latency: 13.2239 ms (enqueue 1.96084 ms)
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.3948 ms - Host latency: 13.2253 ms (enqueue 1.96047 ms)
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.3944 ms - Host latency: 13.2263 ms (enqueue 1.96707 ms)
[11/19/2024-17:21:43] [I]
[11/19/2024-17:21:43] [I] === Performance summary ===
[11/19/2024-17:21:43] [I] Throughput: 80.6568 qps
[11/19/2024-17:21:43] [I] Latency: min = 12.9473 ms, max = 13.2667 ms, mean = 13.2213 ms, median = 13.2251 ms, percentile(90%) = 13.2493 ms, percentile(95%) = 13.2561 ms, percentile(99%) = 13.2632 ms
[11/19/2024-17:21:43] [I] Enqueue Time: min = 1.92603 ms, max = 3.52295 ms, mean = 1.99036 ms, median = 1.97241 ms, percentile(90%) = 2.0332 ms, percentile(95%) = 2.06165 ms, percentile(99%) = 2.61487 ms
[11/19/2024-17:21:43] [I] H2D Latency: min = 0.358154 ms, max = 0.586304 ms, mean = 0.570459 ms, median = 0.573242 ms, percentile(90%) = 0.580566 ms, percentile(95%) = 0.582764 ms, percentile(99%) = 0.585815 ms
[11/19/2024-17:21:43] [I] GPU Compute Time: min = 12.3115 ms, max = 12.4321 ms, mean = 12.3928 ms, median = 12.3928 ms, percentile(90%) = 12.4167 ms, percentile(95%) = 12.4238 ms, percentile(99%) = 12.4309 ms
[11/19/2024-17:21:43] [I] D2H Latency: min = 0.166748 ms, max = 0.264282 ms, mean = 0.258093 ms, median = 0.258606 ms, percentile(90%) = 0.261475 ms, percentile(95%) = 0.262207 ms, percentile(99%) = 0.26355 ms
[11/19/2024-17:21:43] [I] Total Host Walltime: 2.0829 s
[11/19/2024-17:21:43] [I] Total GPU Compute Time: 2.08198 s
[11/19/2024-17:21:43] [I] Explanations of the performance metrics are printed in the verbose logs.
[11/19/2024-17:21:43] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=./model/yolo11n.engine