The batch inference does not work on the jetson Orin NX 16G

Hello,

I have a question for batch inference on jetson Orin NX 16G.
My condition is as follow.

Jetpack version : 5.1.2
model : yolov7
batch : 1, 2, 4, 8

<batch 1>
…/…/bin/trtexec --loadEngine=/cjs_share/yolov7_544_1.engine --verbose --avgRuns=200 --separateProfileRun --useSpinWait
&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # …/…/bin/trtexec --loadEngine=/cjs_share/yolov7_544_1.engine --verbose --avgRuns=200 --separateProfileRun --useSpinWait
[01/23/2025-06:03:34] [I] === Model Options ===
[01/23/2025-06:03:34] [I] Format: *
[01/23/2025-06:03:34] [I] Model:
[01/23/2025-06:03:34] [I] Output:
[01/23/2025-06:03:34] [I] === Build Options ===
[01/23/2025-06:03:34] [I] Max batch: 1
[01/23/2025-06:03:34] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[01/23/2025-06:03:34] [I] minTiming: 1
[01/23/2025-06:03:34] [I] avgTiming: 8
[01/23/2025-06:03:34] [I] Precision: FP32
[01/23/2025-06:03:34] [I] LayerPrecisions:
[01/23/2025-06:03:34] [I] Calibration:
[01/23/2025-06:03:34] [I] Refit: Disabled
[01/23/2025-06:03:34] [I] Sparsity: Disabled
[01/23/2025-06:03:34] [I] Safe mode: Disabled
[01/23/2025-06:03:34] [I] DirectIO mode: Disabled
[01/23/2025-06:03:34] [I] Restricted mode: Disabled
[01/23/2025-06:03:34] [I] Build only: Disabled
[01/23/2025-06:03:34] [I] Save engine:
[01/23/2025-06:03:34] [I] Load engine: /cjs_share/yolov7_544_1.engine
[01/23/2025-06:03:34] [I] Profiling verbosity: 0
[01/23/2025-06:03:34] [I] Tactic sources: Using default tactic sources
[01/23/2025-06:03:34] [I] timingCacheMode: local
[01/23/2025-06:03:34] [I] timingCacheFile:
[01/23/2025-06:03:34] [I] Heuristic: Disabled
[01/23/2025-06:03:34] [I] Preview Features: Use default preview flags.
[01/23/2025-06:03:34] [I] Input(s)s format: fp32:CHW
[01/23/2025-06:03:34] [I] Output(s)s format: fp32:CHW
[01/23/2025-06:03:34] [I] Input build shapes: model
[01/23/2025-06:03:34] [I] Input calibration shapes: model
[01/23/2025-06:03:34] [I] === System Options ===
[01/23/2025-06:03:34] [I] Device: 0
[01/23/2025-06:03:34] [I] DLACore:
[01/23/2025-06:03:34] [I] Plugins:
[01/23/2025-06:03:34] [I] === Inference Options ===
[01/23/2025-06:03:34] [I] Batch: 1
[01/23/2025-06:03:34] [I] Input inference shapes: model
[01/23/2025-06:03:34] [I] Iterations: 10
[01/23/2025-06:03:34] [I] Duration: 3s (+ 200ms warm up)
[01/23/2025-06:03:34] [I] Sleep time: 0ms
[01/23/2025-06:03:34] [I] Idle time: 0ms
[01/23/2025-06:03:34] [I] Streams: 1
[01/23/2025-06:03:34] [I] ExposeDMA: Disabled
[01/23/2025-06:03:34] [I] Data transfers: Enabled
[01/23/2025-06:03:34] [I] Spin-wait: Enabled
[01/23/2025-06:03:34] [I] Multithreading: Disabled
[01/23/2025-06:03:34] [I] CUDA Graph: Disabled
[01/23/2025-06:03:34] [I] Separate profiling: Enabled
[01/23/2025-06:03:34] [I] Time Deserialize: Disabled
[01/23/2025-06:03:34] [I] Time Refit: Disabled
[01/23/2025-06:03:34] [I] NVTX verbosity: 0
[01/23/2025-06:03:34] [I] Persistent Cache Ratio: 0
[01/23/2025-06:03:34] [I] Inputs:
[01/23/2025-06:03:34] [I] === Reporting Options ===
[01/23/2025-06:03:34] [I] Verbose: Enabled
[01/23/2025-06:03:34] [I] Averages: 200 inferences
[01/23/2025-06:03:34] [I] Percentiles: 90,95,99
[01/23/2025-06:03:34] [I] Dump refittable layers:Disabled
[01/23/2025-06:03:34] [I] Dump output: Disabled
[01/23/2025-06:03:34] [I] Profile: Disabled
[01/23/2025-06:03:34] [I] Export timing to JSON file:
[01/23/2025-06:03:34] [I] Export output to JSON file:
[01/23/2025-06:03:34] [I] Export profile to JSON file:
[01/23/2025-06:03:34] [I]
[01/23/2025-06:03:34] [I] === Device Information ===
[01/23/2025-06:03:34] [I] Selected Device: Orin
[01/23/2025-06:03:34] [I] Compute Capability: 8.7
[01/23/2025-06:03:34] [I] SMs: 8
[01/23/2025-06:03:34] [I] Compute Clock Rate: 0.918 GHz
[01/23/2025-06:03:34] [I] Device Global Memory: 15523 MiB
[01/23/2025-06:03:34] [I] Shared Memory per SM: 164 KiB
[01/23/2025-06:03:34] [I] Memory Bus Width: 256 bits (ECC disabled)
[01/23/2025-06:03:34] [I] Memory Clock Rate: 0.918 GHz
[01/23/2025-06:03:34] [I]
[01/23/2025-06:03:34] [I] TensorRT version: 8.5.2
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::BatchedNMSDynamic_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::BatchedNMS_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::BatchTilePlugin_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::Clip_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::CoordConvAC version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::CropAndResizeDynamic version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::CropAndResize version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::DecodeBbox3DPlugin version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::DetectionLayer_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::EfficientNMS_Explicit_TF_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::EfficientNMS_Implicit_TF_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::EfficientNMS_ONNX_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::EfficientNMS_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::FlattenConcat_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::GenerateDetection_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::GridAnchor_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::GridAnchorRect_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::GroupNorm version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 2
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::LayerNorm version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::LReLU_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::MultilevelCropAndResize_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::MultilevelProposeROI_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::MultiscaleDeformableAttnPlugin_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::NMSDynamic_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::NMS_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::Normalize_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::PillarScatterPlugin version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::PriorBox_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::ProposalDynamic version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::ProposalLayer_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::Proposal version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::PyramidROIAlign_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::Region_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::Reorg_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::ResizeNearest_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::ROIAlign_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::RPROI_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::ScatterND version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::SeqLen2Spatial version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::SpecialSlice_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::SplitGeLU version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::Split version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::VoxelGeneratorPlugin version 1
[01/23/2025-06:03:34] [I] Engine loaded in 0.0156846 sec.
[01/23/2025-06:03:35] [I] [TRT] Loaded engine size: 24 MiB
[01/23/2025-06:03:35] [W] [TRT] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[01/23/2025-06:03:35] [V] [TRT] Deserialization required 40127 microseconds.
[01/23/2025-06:03:35] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +22, now: CPU 0, GPU 22 (MiB)
[01/23/2025-06:03:35] [I] Engine deserialized in 0.557384 sec.
[01/23/2025-06:03:35] [V] [TRT] Total per-runner device persistent memory is 451584
[01/23/2025-06:03:35] [V] [TRT] Total per-runner host persistent memory is 282560
[01/23/2025-06:03:35] [V] [TRT] Allocated activation device memory of size 33925120
[01/23/2025-06:03:35] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +33, now: CPU 0, GPU 55 (MiB)
[01/23/2025-06:03:35] [I] Setting persistentCacheLimit to 0 bytes.
[01/23/2025-06:03:35] [V] Using enqueueV3.
[01/23/2025-06:03:35] [I] Using random values for input input.1
[01/23/2025-06:03:35] [I] Created input binding for input.1 with dimensions 1x3x544x544
[01/23/2025-06:03:35] [I] Using random values for output output.det1
[01/23/2025-06:03:35] [I] Created output binding for output.det1 with dimensions 1x45x68x68
[01/23/2025-06:03:35] [I] Using random values for output output.det2
[01/23/2025-06:03:35] [I] Created output binding for output.det2 with dimensions 1x45x34x34
[01/23/2025-06:03:35] [I] Using random values for output output.det3
[01/23/2025-06:03:35] [I] Created output binding for output.det3 with dimensions 1x45x17x17
[01/23/2025-06:03:35] [I] Starting inference
[01/23/2025-06:03:38] [I] Warmup completed 21 queries over 200 ms
[01/23/2025-06:03:38] [I] Timing trace has 277 queries over 3.03035 s
[01/23/2025-06:03:38] [I]
[01/23/2025-06:03:38] [I] === Trace details ===
[01/23/2025-06:03:38] [I] Trace averages of 200 runs:
[01/23/2025-06:03:38] [I] Average on 200 runs - GPU latency: 10.7752 ms - Host latency: 11.0283 ms (enqueue 1.46886 ms)
[01/23/2025-06:03:38] [I]
[01/23/2025-06:03:38] [I] === Performance summary ===
[01/23/2025-06:03:38] [I] Throughput: 91.4087 qps
[01/23/2025-06:03:38] [I] Latency: min = 10.0884 ms, max = 12.599 ms, mean = 11.1561 ms, median = 11.2317 ms, percentile(90%) = 11.5901 ms, percentile(95%) = 11.8546 ms, percentile(99%) = 12.4006 ms
[01/23/2025-06:03:38] [I] Enqueue Time: min = 1.26912 ms, max = 2.56714 ms, mean = 1.48365 ms, median = 1.33252 ms, percentile(90%) = 1.71826 ms, percentile(95%) = 2.01587 ms, percentile(99%) = 2.22797 ms
[01/23/2025-06:03:38] [I] H2D Latency: min = 0.172363 ms, max = 0.208008 ms, mean = 0.178964 ms, median = 0.177979 ms, percentile(90%) = 0.181885 ms, percentile(95%) = 0.187439 ms, percentile(99%) = 0.201416 ms
[01/23/2025-06:03:38] [I] GPU Compute Time: min = 9.83664 ms, max = 12.3372 ms, mean = 10.9031 ms, median = 10.9792 ms, percentile(90%) = 11.3383 ms, percentile(95%) = 11.5973 ms, percentile(99%) = 12.1422 ms
[01/23/2025-06:03:38] [I] D2H Latency: min = 0.0512695 ms, max = 0.0803223 ms, mean = 0.0740428 ms, median = 0.0737915 ms, percentile(90%) = 0.0762024 ms, percentile(95%) = 0.0769653 ms, percentile(99%) = 0.079834 ms
[01/23/2025-06:03:38] [I] Total Host Walltime: 3.03035 s
[01/23/2025-06:03:38] [I] Total GPU Compute Time: 3.02016 s
[01/23/2025-06:03:38] [W] * GPU compute time is unstable, with coefficient of variance = 5.58003%.
[01/23/2025-06:03:38] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[01/23/2025-06:03:38] [I] Explanations of the performance metrics are printed in the verbose logs.
[01/23/2025-06:03:38] [V]
[01/23/2025-06:03:38] [V] === Explanations of the performance metrics ===
[01/23/2025-06:03:38] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[01/23/2025-06:03:38] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[01/23/2025-06:03:38] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[01/23/2025-06:03:38] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[01/23/2025-06:03:38] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[01/23/2025-06:03:38] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[01/23/2025-06:03:38] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[01/23/2025-06:03:38] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[01/23/2025-06:03:38] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # …/…/bin/trtexec --loadEngine=/cjs_share/yolov7_544_1.engine --verbose --avgRuns=200 --separateProfileRun --useSpinWait

<batch 8>
…/…/bin/trtexec --loadEngine=/cjs_share/yolov7_544_1.engine --verbose --avgRuns=200 --separateProfileRun --useSpinWait
&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # …/…/bin/trtexec --loadEngine=/cjs_share/yolov7_544_1.engine --verbose --avgRuns=200 --separateProfileRun --useSpinWait
[01/23/2025-06:03:34] [I] === Model Options ===
[01/23/2025-06:03:34] [I] Format: *
[01/23/2025-06:03:34] [I] Model:
[01/23/2025-06:03:34] [I] Output:
[01/23/2025-06:03:34] [I] === Build Options ===
[01/23/2025-06:03:34] [I] Max batch: 1
[01/23/2025-06:03:34] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[01/23/2025-06:03:34] [I] minTiming: 1
[01/23/2025-06:03:34] [I] avgTiming: 8
[01/23/2025-06:03:34] [I] Precision: FP32
[01/23/2025-06:03:34] [I] LayerPrecisions:
[01/23/2025-06:03:34] [I] Calibration:
[01/23/2025-06:03:34] [I] Refit: Disabled
[01/23/2025-06:03:34] [I] Sparsity: Disabled
[01/23/2025-06:03:34] [I] Safe mode: Disabled
[01/23/2025-06:03:34] [I] DirectIO mode: Disabled
[01/23/2025-06:03:34] [I] Restricted mode: Disabled
[01/23/2025-06:03:34] [I] Build only: Disabled
[01/23/2025-06:03:34] [I] Save engine:
[01/23/2025-06:03:34] [I] Load engine: /cjs_share/yolov7_544_1.engine
[01/23/2025-06:03:34] [I] Profiling verbosity: 0
[01/23/2025-06:03:34] [I] Tactic sources: Using default tactic sources
[01/23/2025-06:03:34] [I] timingCacheMode: local
[01/23/2025-06:03:34] [I] timingCacheFile:
[01/23/2025-06:03:34] [I] Heuristic: Disabled
[01/23/2025-06:03:34] [I] Preview Features: Use default preview flags.
[01/23/2025-06:03:34] [I] Input(s)s format: fp32:CHW
[01/23/2025-06:03:34] [I] Output(s)s format: fp32:CHW
[01/23/2025-06:03:34] [I] Input build shapes: model
[01/23/2025-06:03:34] [I] Input calibration shapes: model
[01/23/2025-06:03:34] [I] === System Options ===
[01/23/2025-06:03:34] [I] Device: 0
[01/23/2025-06:03:34] [I] DLACore:
[01/23/2025-06:03:34] [I] Plugins:
[01/23/2025-06:03:34] [I] === Inference Options ===
[01/23/2025-06:03:34] [I] Batch: 1
[01/23/2025-06:03:34] [I] Input inference shapes: model
[01/23/2025-06:03:34] [I] Iterations: 10
[01/23/2025-06:03:34] [I] Duration: 3s (+ 200ms warm up)
[01/23/2025-06:03:34] [I] Sleep time: 0ms
[01/23/2025-06:03:34] [I] Idle time: 0ms
[01/23/2025-06:03:34] [I] Streams: 1
[01/23/2025-06:03:34] [I] ExposeDMA: Disabled
[01/23/2025-06:03:34] [I] Data transfers: Enabled
[01/23/2025-06:03:34] [I] Spin-wait: Enabled
[01/23/2025-06:03:34] [I] Multithreading: Disabled
[01/23/2025-06:03:34] [I] CUDA Graph: Disabled
[01/23/2025-06:03:34] [I] Separate profiling: Enabled
[01/23/2025-06:03:34] [I] Time Deserialize: Disabled
[01/23/2025-06:03:34] [I] Time Refit: Disabled
[01/23/2025-06:03:34] [I] NVTX verbosity: 0
[01/23/2025-06:03:34] [I] Persistent Cache Ratio: 0
[01/23/2025-06:03:34] [I] Inputs:
[01/23/2025-06:03:34] [I] === Reporting Options ===
[01/23/2025-06:03:34] [I] Verbose: Enabled
[01/23/2025-06:03:34] [I] Averages: 200 inferences
[01/23/2025-06:03:34] [I] Percentiles: 90,95,99
[01/23/2025-06:03:34] [I] Dump refittable layers:Disabled
[01/23/2025-06:03:34] [I] Dump output: Disabled
[01/23/2025-06:03:34] [I] Profile: Disabled
[01/23/2025-06:03:34] [I] Export timing to JSON file:
[01/23/2025-06:03:34] [I] Export output to JSON file:
[01/23/2025-06:03:34] [I] Export profile to JSON file:
[01/23/2025-06:03:34] [I]
[01/23/2025-06:03:34] [I] === Device Information ===
[01/23/2025-06:03:34] [I] Selected Device: Orin
[01/23/2025-06:03:34] [I] Compute Capability: 8.7
[01/23/2025-06:03:34] [I] SMs: 8
[01/23/2025-06:03:34] [I] Compute Clock Rate: 0.918 GHz
[01/23/2025-06:03:34] [I] Device Global Memory: 15523 MiB
[01/23/2025-06:03:34] [I] Shared Memory per SM: 164 KiB
[01/23/2025-06:03:34] [I] Memory Bus Width: 256 bits (ECC disabled)
[01/23/2025-06:03:34] [I] Memory Clock Rate: 0.918 GHz
[01/23/2025-06:03:34] [I]
[01/23/2025-06:03:34] [I] TensorRT version: 8.5.2
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::BatchedNMSDynamic_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::BatchedNMS_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::BatchTilePlugin_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::Clip_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::CoordConvAC version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::CropAndResizeDynamic version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::CropAndResize version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::DecodeBbox3DPlugin version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::DetectionLayer_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::EfficientNMS_Explicit_TF_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::EfficientNMS_Implicit_TF_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::EfficientNMS_ONNX_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::EfficientNMS_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::FlattenConcat_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::GenerateDetection_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::GridAnchor_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::GridAnchorRect_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::GroupNorm version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 2
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::LayerNorm version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::LReLU_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::MultilevelCropAndResize_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::MultilevelProposeROI_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::MultiscaleDeformableAttnPlugin_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::NMSDynamic_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::NMS_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::Normalize_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::PillarScatterPlugin version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::PriorBox_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::ProposalDynamic version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::ProposalLayer_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::Proposal version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::PyramidROIAlign_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::Region_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::Reorg_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::ResizeNearest_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::ROIAlign_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::RPROI_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::ScatterND version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::SeqLen2Spatial version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::SpecialSlice_TRT version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::SplitGeLU version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::Split version 1
[01/23/2025-06:03:34] [V] [TRT] Registered plugin creator - ::VoxelGeneratorPlugin version 1
[01/23/2025-06:03:34] [I] Engine loaded in 0.0156846 sec.
[01/23/2025-06:03:35] [I] [TRT] Loaded engine size: 24 MiB
[01/23/2025-06:03:35] [W] [TRT] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[01/23/2025-06:03:35] [V] [TRT] Deserialization required 40127 microseconds.
[01/23/2025-06:03:35] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +22, now: CPU 0, GPU 22 (MiB)
[01/23/2025-06:03:35] [I] Engine deserialized in 0.557384 sec.
[01/23/2025-06:03:35] [V] [TRT] Total per-runner device persistent memory is 451584
[01/23/2025-06:03:35] [V] [TRT] Total per-runner host persistent memory is 282560
[01/23/2025-06:03:35] [V] [TRT] Allocated activation device memory of size 33925120
[01/23/2025-06:03:35] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +33, now: CPU 0, GPU 55 (MiB)
[01/23/2025-06:03:35] [I] Setting persistentCacheLimit to 0 bytes.
[01/23/2025-06:03:35] [V] Using enqueueV3.
[01/23/2025-06:03:35] [I] Using random values for input input.1
[01/23/2025-06:03:35] [I] Created input binding for input.1 with dimensions 1x3x544x544
[01/23/2025-06:03:35] [I] Using random values for output output.det1
[01/23/2025-06:03:35] [I] Created output binding for output.det1 with dimensions 1x45x68x68
[01/23/2025-06:03:35] [I] Using random values for output output.det2
[01/23/2025-06:03:35] [I] Created output binding for output.det2 with dimensions 1x45x34x34
[01/23/2025-06:03:35] [I] Using random values for output output.det3
[01/23/2025-06:03:35] [I] Created output binding for output.det3 with dimensions 1x45x17x17
[01/23/2025-06:03:35] [I] Starting inference
[01/23/2025-06:03:38] [I] Warmup completed 21 queries over 200 ms
[01/23/2025-06:03:38] [I] Timing trace has 277 queries over 3.03035 s
[01/23/2025-06:03:38] [I]
[01/23/2025-06:03:38] [I] === Trace details ===
[01/23/2025-06:03:38] [I] Trace averages of 200 runs:
[01/23/2025-06:03:38] [I] Average on 200 runs - GPU latency: 10.7752 ms - Host latency: 11.0283 ms (enqueue 1.46886 ms)
[01/23/2025-06:03:38] [I]
[01/23/2025-06:03:38] [I] === Performance summary ===
[01/23/2025-06:03:38] [I] Throughput: 91.4087 qps
[01/23/2025-06:03:38] [I] Latency: min = 10.0884 ms, max = 12.599 ms, mean = 11.1561 ms, median = 11.2317 ms, percentile(90%) = 11.5901 ms, percentile(95%) = 11.8546 ms, percentile(99%) = 12.4006 ms
[01/23/2025-06:03:38] [I] Enqueue Time: min = 1.26912 ms, max = 2.56714 ms, mean = 1.48365 ms, median = 1.33252 ms, percentile(90%) = 1.71826 ms, percentile(95%) = 2.01587 ms, percentile(99%) = 2.22797 ms
[01/23/2025-06:03:38] [I] H2D Latency: min = 0.172363 ms, max = 0.208008 ms, mean = 0.178964 ms, median = 0.177979 ms, percentile(90%) = 0.181885 ms, percentile(95%) = 0.187439 ms, percentile(99%) = 0.201416 ms
[01/23/2025-06:03:38] [I] GPU Compute Time: min = 9.83664 ms, max = 12.3372 ms, mean = 10.9031 ms, median = 10.9792 ms, percentile(90%) = 11.3383 ms, percentile(95%) = 11.5973 ms, percentile(99%) = 12.1422 ms
[01/23/2025-06:03:38] [I] D2H Latency: min = 0.0512695 ms, max = 0.0803223 ms, mean = 0.0740428 ms, median = 0.0737915 ms, percentile(90%) = 0.0762024 ms, percentile(95%) = 0.0769653 ms, percentile(99%) = 0.079834 ms
[01/23/2025-06:03:38] [I] Total Host Walltime: 3.03035 s
[01/23/2025-06:03:38] [I] Total GPU Compute Time: 3.02016 s
[01/23/2025-06:03:38] [W] * GPU compute time is unstable, with coefficient of variance = 5.58003%.
[01/23/2025-06:03:38] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[01/23/2025-06:03:38] [I] Explanations of the performance metrics are printed in the verbose logs.
[01/23/2025-06:03:38] [V]
[01/23/2025-06:03:38] [V] === Explanations of the performance metrics ===
[01/23/2025-06:03:38] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[01/23/2025-06:03:38] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[01/23/2025-06:03:38] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[01/23/2025-06:03:38] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[01/23/2025-06:03:38] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[01/23/2025-06:03:38] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[01/23/2025-06:03:38] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[01/23/2025-06:03:38] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[01/23/2025-06:03:38] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # …/…/bin/trtexec --loadEngine=/cjs_share/yolov7_544_1.engine --verbose --avgRuns=200 --separateProfileRun --useSpinWait
root@AIB-800-48B02DD8DEC1:/workspace/tensorrt/samples/sampleOnnxYolo# …/…/bin/trtexec --loadEngine=/cjs_share/yolov7_544_8.engine --verbose --avgRuns=200 --separateProfileRun --useSpinWait
&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # …/…/bin/trtexec --loadEngine=/cjs_share/yolov7_544_8.engine --verbose --avgRuns=200 --separateProfileRun --useSpinWait
[01/23/2025-06:04:26] [I] === Model Options ===
[01/23/2025-06:04:26] [I] Format: *
[01/23/2025-06:04:26] [I] Model:
[01/23/2025-06:04:26] [I] Output:
[01/23/2025-06:04:26] [I] === Build Options ===
[01/23/2025-06:04:26] [I] Max batch: 1
[01/23/2025-06:04:26] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[01/23/2025-06:04:26] [I] minTiming: 1
[01/23/2025-06:04:26] [I] avgTiming: 8
[01/23/2025-06:04:26] [I] Precision: FP32
[01/23/2025-06:04:26] [I] LayerPrecisions:
[01/23/2025-06:04:26] [I] Calibration:
[01/23/2025-06:04:26] [I] Refit: Disabled
[01/23/2025-06:04:26] [I] Sparsity: Disabled
[01/23/2025-06:04:26] [I] Safe mode: Disabled
[01/23/2025-06:04:26] [I] DirectIO mode: Disabled
[01/23/2025-06:04:26] [I] Restricted mode: Disabled
[01/23/2025-06:04:26] [I] Build only: Disabled
[01/23/2025-06:04:26] [I] Save engine:
[01/23/2025-06:04:26] [I] Load engine: /cjs_share/yolov7_544_8.engine
[01/23/2025-06:04:26] [I] Profiling verbosity: 0
[01/23/2025-06:04:26] [I] Tactic sources: Using default tactic sources
[01/23/2025-06:04:26] [I] timingCacheMode: local
[01/23/2025-06:04:26] [I] timingCacheFile:
[01/23/2025-06:04:26] [I] Heuristic: Disabled
[01/23/2025-06:04:26] [I] Preview Features: Use default preview flags.
[01/23/2025-06:04:26] [I] Input(s)s format: fp32:CHW
[01/23/2025-06:04:26] [I] Output(s)s format: fp32:CHW
[01/23/2025-06:04:26] [I] Input build shapes: model
[01/23/2025-06:04:26] [I] Input calibration shapes: model
[01/23/2025-06:04:26] [I] === System Options ===
[01/23/2025-06:04:26] [I] Device: 0
[01/23/2025-06:04:26] [I] DLACore:
[01/23/2025-06:04:26] [I] Plugins:
[01/23/2025-06:04:26] [I] === Inference Options ===
[01/23/2025-06:04:26] [I] Batch: 1
[01/23/2025-06:04:26] [I] Input inference shapes: model
[01/23/2025-06:04:26] [I] Iterations: 10
[01/23/2025-06:04:26] [I] Duration: 3s (+ 200ms warm up)
[01/23/2025-06:04:26] [I] Sleep time: 0ms
[01/23/2025-06:04:26] [I] Idle time: 0ms
[01/23/2025-06:04:26] [I] Streams: 1
[01/23/2025-06:04:26] [I] ExposeDMA: Disabled
[01/23/2025-06:04:26] [I] Data transfers: Enabled
[01/23/2025-06:04:26] [I] Spin-wait: Enabled
[01/23/2025-06:04:26] [I] Multithreading: Disabled
[01/23/2025-06:04:26] [I] CUDA Graph: Disabled
[01/23/2025-06:04:26] [I] Separate profiling: Enabled
[01/23/2025-06:04:26] [I] Time Deserialize: Disabled
[01/23/2025-06:04:26] [I] Time Refit: Disabled
[01/23/2025-06:04:26] [I] NVTX verbosity: 0
[01/23/2025-06:04:26] [I] Persistent Cache Ratio: 0
[01/23/2025-06:04:26] [I] Inputs:
[01/23/2025-06:04:26] [I] === Reporting Options ===
[01/23/2025-06:04:26] [I] Verbose: Enabled
[01/23/2025-06:04:26] [I] Averages: 200 inferences
[01/23/2025-06:04:26] [I] Percentiles: 90,95,99
[01/23/2025-06:04:26] [I] Dump refittable layers:Disabled
[01/23/2025-06:04:26] [I] Dump output: Disabled
[01/23/2025-06:04:26] [I] Profile: Disabled
[01/23/2025-06:04:26] [I] Export timing to JSON file:
[01/23/2025-06:04:26] [I] Export output to JSON file:
[01/23/2025-06:04:26] [I] Export profile to JSON file:
[01/23/2025-06:04:26] [I]
[01/23/2025-06:04:26] [I] === Device Information ===
[01/23/2025-06:04:26] [I] Selected Device: Orin
[01/23/2025-06:04:26] [I] Compute Capability: 8.7
[01/23/2025-06:04:26] [I] SMs: 8
[01/23/2025-06:04:26] [I] Compute Clock Rate: 0.918 GHz
[01/23/2025-06:04:26] [I] Device Global Memory: 15523 MiB
[01/23/2025-06:04:26] [I] Shared Memory per SM: 164 KiB
[01/23/2025-06:04:26] [I] Memory Bus Width: 256 bits (ECC disabled)
[01/23/2025-06:04:26] [I] Memory Clock Rate: 0.918 GHz
[01/23/2025-06:04:26] [I]
[01/23/2025-06:04:26] [I] TensorRT version: 8.5.2
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::BatchedNMSDynamic_TRT version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::BatchedNMS_TRT version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::BatchTilePlugin_TRT version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::Clip_TRT version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::CoordConvAC version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::CropAndResizeDynamic version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::CropAndResize version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::DecodeBbox3DPlugin version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::DetectionLayer_TRT version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::EfficientNMS_Explicit_TF_TRT version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::EfficientNMS_Implicit_TF_TRT version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::EfficientNMS_ONNX_TRT version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::EfficientNMS_TRT version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::FlattenConcat_TRT version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::GenerateDetection_TRT version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::GridAnchor_TRT version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::GridAnchorRect_TRT version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::GroupNorm version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 2
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::LayerNorm version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::LReLU_TRT version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::MultilevelCropAndResize_TRT version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::MultilevelProposeROI_TRT version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::MultiscaleDeformableAttnPlugin_TRT version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::NMSDynamic_TRT version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::NMS_TRT version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::Normalize_TRT version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::PillarScatterPlugin version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::PriorBox_TRT version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::ProposalDynamic version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::ProposalLayer_TRT version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::Proposal version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::PyramidROIAlign_TRT version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::Region_TRT version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::Reorg_TRT version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::ResizeNearest_TRT version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::ROIAlign_TRT version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::RPROI_TRT version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::ScatterND version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::SeqLen2Spatial version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::SpecialSlice_TRT version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::SplitGeLU version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::Split version 1
[01/23/2025-06:04:26] [V] [TRT] Registered plugin creator - ::VoxelGeneratorPlugin version 1
[01/23/2025-06:04:26] [I] Engine loaded in 0.0156243 sec.
[01/23/2025-06:04:26] [I] [TRT] Loaded engine size: 23 MiB
[01/23/2025-06:04:27] [W] [TRT] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[01/23/2025-06:04:27] [V] [TRT] Deserialization required 28240 microseconds.
[01/23/2025-06:04:27] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +22, now: CPU 0, GPU 22 (MiB)
[01/23/2025-06:04:27] [I] Engine deserialized in 0.540434 sec.
[01/23/2025-06:04:27] [V] [TRT] Total per-runner device persistent memory is 444416
[01/23/2025-06:04:27] [V] [TRT] Total per-runner host persistent memory is 280576
[01/23/2025-06:04:27] [V] [TRT] Allocated activation device memory of size 271375360
[01/23/2025-06:04:27] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +259, now: CPU 0, GPU 281 (MiB)
[01/23/2025-06:04:27] [I] Setting persistentCacheLimit to 0 bytes.
[01/23/2025-06:04:27] [V] Using enqueueV3.
[01/23/2025-06:04:27] [I] Using random values for input input.1
[01/23/2025-06:04:27] [I] Created input binding for input.1 with dimensions 8x3x544x544
[01/23/2025-06:04:27] [I] Using random values for output output.det1
[01/23/2025-06:04:27] [I] Created output binding for output.det1 with dimensions 8x45x68x68
[01/23/2025-06:04:27] [I] Using random values for output output.det2
[01/23/2025-06:04:27] [I] Created output binding for output.det2 with dimensions 8x45x34x34
[01/23/2025-06:04:27] [I] Using random values for output output.det3
[01/23/2025-06:04:27] [I] Created output binding for output.det3 with dimensions 8x45x17x17
[01/23/2025-06:04:27] [I] Starting inference
[01/23/2025-06:04:31] [I] Warmup completed 3 queries over 200 ms
[01/23/2025-06:04:31] [I] Timing trace has 37 queries over 3.27673 s
[01/23/2025-06:04:31] [I]
[01/23/2025-06:04:31] [I] === Trace details ===
[01/23/2025-06:04:31] [I] Trace averages of 200 runs:
[01/23/2025-06:04:31] [I]
[01/23/2025-06:04:31] [I] === Performance summary ===
[01/23/2025-06:04:31] [I] Throughput: 11.2917 qps
[01/23/2025-06:04:31] [I] Latency: min = 79.6931 ms, max = 93.7979 ms, mean = 88.199 ms, median = 89.6377 ms, percentile(90%) = 91.3596 ms, percentile(95%) = 92.6952 ms, percentile(99%) = 93.7979 ms
[01/23/2025-06:04:31] [I] Enqueue Time: min = 2.03638 ms, max = 3.47705 ms, mean = 2.46767 ms, median = 2.3988 ms, percentile(90%) = 2.90894 ms, percentile(95%) = 3.29858 ms, percentile(99%) = 3.47705 ms
[01/23/2025-06:04:31] [I] H2D Latency: min = 1.19336 ms, max = 1.33252 ms, mean = 1.26532 ms, median = 1.27563 ms, percentile(90%) = 1.31517 ms, percentile(95%) = 1.3197 ms, percentile(99%) = 1.33252 ms
[01/23/2025-06:04:31] [I] GPU Compute Time: min = 77.9626 ms, max = 92.0701 ms, mean = 86.4584 ms, median = 87.9434 ms, percentile(90%) = 89.5012 ms, percentile(95%) = 90.9288 ms, percentile(99%) = 92.0701 ms
[01/23/2025-06:04:31] [I] D2H Latency: min = 0.357178 ms, max = 0.525879 ms, mean = 0.475301 ms, median = 0.487671 ms, percentile(90%) = 0.492737 ms, percentile(95%) = 0.524902 ms, percentile(99%) = 0.525879 ms
[01/23/2025-06:04:31] [I] Total Host Walltime: 3.27673 s
[01/23/2025-06:04:31] [I] Total GPU Compute Time: 3.19896 s
[01/23/2025-06:04:31] [W] * GPU compute time is unstable, with coefficient of variance = 4.40581%.
[01/23/2025-06:04:31] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[01/23/2025-06:04:31] [I] Explanations of the performance metrics are printed in the verbose logs.
[01/23/2025-06:04:31] [V]
[01/23/2025-06:04:31] [V] === Explanations of the performance metrics ===
[01/23/2025-06:04:31] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[01/23/2025-06:04:31] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[01/23/2025-06:04:31] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[01/23/2025-06:04:31] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[01/23/2025-06:04:31] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[01/23/2025-06:04:31] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[01/23/2025-06:04:31] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[01/23/2025-06:04:31] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[01/23/2025-06:04:31] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # …/…/bin/trtexec --loadEngine=/cjs_share/yolov7_544_8.engine --verbose --avgRuns=200 --separateProfileRun --useSpinWait

I think the batch inference result increases almost linearly.
I wonder how I can get the effect of batch inference.

Please reply.
Thanks,

Hi,

Have you maximized the device performance when benchmarking?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

The acceleration of the batch depends on the model architecture and is not guaranteed.
Please find below the link for more info about batching:

Thanks.

Hi

I ran that command and tried again, but GPU compute time was same.
You said you can’t guarantee the acceleration of the batch depending on the model,
so is the yolov7 public model also not guaranteed?

And Is there anything wrong with that log?
I’m inquiring because the batch inference doesn’t work at all.

Thanks.

Hi,

This is model-dependent.
For example, if your model already fully occupies the GPU resources (computation > memory read/write).
Then you might find the batch acceleration is limited as the extra computation still needs to wait for the resources.

However, if the GPU needs to wait for the data then running more data concurrently (batch inference) can minimize the GPU idle time.
To know more about your use case, would you mind running your model with Nsight System and gathering some profiling data for us?

Thanks.

Hi

I’m sorry for the late reply because of the holiday.

I tried the Nsight system for batch inference and uploaded the resulting report file.

Please let me know if you need any more data

report1.zip (485.3 KB)

Hi

You said that batch inference is model dependent, so can you analyze the reason why batch inference is not possible if we attach the model?

please reply.

Thanks,

Hi,

Based on the profiling you shared, the GPU are fully utilized (100% kernels).
So this should be the reason why you don’t see the performance gain when increasing batch size.

Thanks.

Hi,

So, Is there no way to improve batch inference performance?
For example how to reduce the kernel usage or modify the model architector etc.
That model just public yolov7 model.
No results from batch inference experiments on yolov7 in Nvidia?

And what does stream 18 mean?

Thanks.

Hi,

If your GPU is fully occupied with batch size = 1.
Increasing batch size won’t reduce the latency as they still need to wait for the resources.

We don’t have YOLOv7 performance for Orin NX.
But you can find some scores of other models below:

Thanks.

1 Like

Based on your log, it seems like the model was exported with max batch = 1

Hi

What you mean is that the conversion is not batched?
I may have been wrong with the conversion method, so could you guide me on how to convert the model to work with batch?

Thanks

What did you use to export?

I use this command to convert

trtexec --onnx=model.onnx --saveEngine=model_batch.trt --minShapes=input:1x3x224x224 --optShapes=input:8x3x224x224 --maxShapes=input:32x3x224x224 --fp16

How did you export the ONNX model?

Do you mean the shape of the onnx model?

No. What code/command did you use to get the ONNX model out of the PyTorch model?