We successfully ran the test with trtexec command. Here is the complete output:
&&&& RUNNING TensorRT.trtexec [TensorRT v100300] # trtexec --loadEngine=deepstream/model/vale.engine
[07/29/2025-18:57:55] [I] === Model Options ===
[07/29/2025-18:57:55] [I] Format: *
[07/29/2025-18:57:55] [I] Model:
[07/29/2025-18:57:55] [I] Output:
[07/29/2025-18:57:55] [I]
[07/29/2025-18:57:55] [I] === System Options ===
[07/29/2025-18:57:55] [I] Device: 0
[07/29/2025-18:57:55] [I] DLACore:
[07/29/2025-18:57:55] [I] Plugins:
[07/29/2025-18:57:55] [I] setPluginsToSerialize:
[07/29/2025-18:57:55] [I] dynamicPlugins:
[07/29/2025-18:57:55] [I] ignoreParsedPluginLibs: 0
[07/29/2025-18:57:55] [I]
[07/29/2025-18:57:55] [I] === Inference Options ===
[07/29/2025-18:57:55] [I] Batch: Explicit
[07/29/2025-18:57:55] [I] Input inference shapes: model
[07/29/2025-18:57:55] [I] Iterations: 10
[07/29/2025-18:57:55] [I] Duration: 3s (+ 200ms warm up)
[07/29/2025-18:57:55] [I] Sleep time: 0ms
[07/29/2025-18:57:55] [I] Idle time: 0ms
[07/29/2025-18:57:55] [I] Inference Streams: 1
[07/29/2025-18:57:55] [I] ExposeDMA: Disabled
[07/29/2025-18:57:55] [I] Data transfers: Enabled
[07/29/2025-18:57:55] [I] Spin-wait: Disabled
[07/29/2025-18:57:55] [I] Multithreading: Disabled
[07/29/2025-18:57:55] [I] CUDA Graph: Disabled
[07/29/2025-18:57:55] [I] Separate profiling: Disabled
[07/29/2025-18:57:55] [I] Time Deserialize: Disabled
[07/29/2025-18:57:55] [I] Time Refit: Disabled
[07/29/2025-18:57:55] [I] NVTX verbosity: 0
[07/29/2025-18:57:55] [I] Persistent Cache Ratio: 0
[07/29/2025-18:57:55] [I] Optimization Profile Index: 0
[07/29/2025-18:57:55] [I] Weight Streaming Budget: 100.000000%
[07/29/2025-18:57:55] [I] Inputs:
[07/29/2025-18:57:55] [I] Debug Tensor Save Destinations:
[07/29/2025-18:57:55] [I] === Reporting Options ===
[07/29/2025-18:57:55] [I] Verbose: Disabled
[07/29/2025-18:57:55] [I] Averages: 10 inferences
[07/29/2025-18:57:55] [I] Percentiles: 90,95,99
[07/29/2025-18:57:55] [I] Dump refittable layers:Disabled
[07/29/2025-18:57:55] [I] Dump output: Disabled
[07/29/2025-18:57:55] [I] Profile: Disabled
[07/29/2025-18:57:55] [I] Export timing to JSON file:
[07/29/2025-18:57:55] [I] Export output to JSON file:
[07/29/2025-18:57:55] [I] Export profile to JSON file:
[07/29/2025-18:57:55] [I]
[07/29/2025-18:57:55] [I] === Device Information ===
[07/29/2025-18:57:55] [I] Available Devices:
[07/29/2025-18:57:55] [I] Device 0: “Tesla T4” UUID: GPU-e0a0c54c-f462-fe69-20f6-f33c1f9bfe6e
[07/29/2025-18:57:56] [I] Selected Device: Tesla T4
[07/29/2025-18:57:56] [I] Selected Device ID: 0
[07/29/2025-18:57:56] [I] Selected Device UUID: GPU-e0a0c54c-f462-fe69-20f6-f33c1f9bfe6e
[07/29/2025-18:57:56] [I] Compute Capability: 7.5
[07/29/2025-18:57:56] [I] SMs: 40
[07/29/2025-18:57:56] [I] Device Global Memory: 15935 MiB
[07/29/2025-18:57:56] [I] Shared Memory per SM: 64 KiB
[07/29/2025-18:57:56] [I] Memory Bus Width: 256 bits (ECC disabled)
[07/29/2025-18:57:56] [I] Application Compute Clock Rate: 1.59 GHz
[07/29/2025-18:57:56] [I] Application Memory Clock Rate: 5.001 GHz
[07/29/2025-18:57:56] [I]
[07/29/2025-18:57:56] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[07/29/2025-18:57:56] [I]
[07/29/2025-18:57:56] [I] TensorRT version: 10.3.0
[07/29/2025-18:57:56] [I] Loading standard plugins
[07/29/2025-18:57:56] [I] [TRT] Loaded engine size: 112 MiB
[07/29/2025-18:57:56] [W] [TRT] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[07/29/2025-18:57:56] [I] Engine deserialized in 0.0765873 sec.
[07/29/2025-18:57:56] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +77, now: CPU 0, GPU 189 (MiB)
[07/29/2025-18:57:56] [I] Setting persistentCacheLimit to 0 bytes.
[07/29/2025-18:57:56] [I] Created execution context with device memory size: 74.2188 MiB
[07/29/2025-18:57:56] [I] Using random values for input input
[07/29/2025-18:57:56] [I] Input binding for input with dimensions 1x3x640x640 is created.
[07/29/2025-18:57:56] [I] Output binding for output with dimensions 1x8400x6 is created.
[07/29/2025-18:57:56] [I] Starting inference
[07/29/2025-18:57:59] [I] Warmup completed 8 queries over 200 ms
[07/29/2025-18:57:59] [I] Timing trace has 130 queries over 3.05927 s
[07/29/2025-18:57:59] [I]
[07/29/2025-18:57:59] [I] === Trace details ===
[07/29/2025-18:57:59] [I] Trace averages of 10 runs:
[07/29/2025-18:57:59] [I] Average on 10 runs - GPU latency: 23.0918 ms - Host latency: 23.5008 ms (enqueue 1.69229 ms)
[07/29/2025-18:57:59] [I] Average on 10 runs - GPU latency: 23.1935 ms - Host latency: 23.6024 ms (enqueue 1.68199 ms)
[07/29/2025-18:57:59] [I] Average on 10 runs - GPU latency: 23.3682 ms - Host latency: 23.7773 ms (enqueue 1.65952 ms)
[07/29/2025-18:57:59] [I] Average on 10 runs - GPU latency: 23.9427 ms - Host latency: 24.352 ms (enqueue 1.65334 ms)
[07/29/2025-18:57:59] [I] Average on 10 runs - GPU latency: 23.0424 ms - Host latency: 23.4516 ms (enqueue 1.64347 ms)
[07/29/2025-18:57:59] [I] Average on 10 runs - GPU latency: 23.2945 ms - Host latency: 23.7028 ms (enqueue 1.65194 ms)
[07/29/2025-18:57:59] [I] Average on 10 runs - GPU latency: 23.1616 ms - Host latency: 23.5696 ms (enqueue 1.62484 ms)
[07/29/2025-18:57:59] [I] Average on 10 runs - GPU latency: 23.7048 ms - Host latency: 24.1139 ms (enqueue 1.63248 ms)
[07/29/2025-18:57:59] [I] Average on 10 runs - GPU latency: 23.2862 ms - Host latency: 23.6951 ms (enqueue 1.65913 ms)
[07/29/2025-18:57:59] [I] Average on 10 runs - GPU latency: 23.2471 ms - Host latency: 23.6559 ms (enqueue 1.65354 ms)
[07/29/2025-18:57:59] [I] Average on 10 runs - GPU latency: 23.2984 ms - Host latency: 23.7068 ms (enqueue 1.65906 ms)
[07/29/2025-18:57:59] [I] Average on 10 runs - GPU latency: 23.5356 ms - Host latency: 23.944 ms (enqueue 1.679 ms)
[07/29/2025-18:57:59] [I] Average on 10 runs - GPU latency: 23.3919 ms - Host latency: 23.7993 ms (enqueue 1.63882 ms)
[07/29/2025-18:57:59] [I]
[07/29/2025-18:57:59] [I] === Performance summary ===
[07/29/2025-18:57:59] [I] Throughput: 42.4937 qps
[07/29/2025-18:57:59] [I] Latency: min = 23.2473 ms, max = 25.9184 ms, mean = 23.7593 ms, median = 23.697 ms, percentile(90%) = 24.1882 ms, percentile(95%) = 24.3213 ms, percentile(99%) = 25.8605 ms
[07/29/2025-18:57:59] [I] Enqueue Time: min = 1.60962 ms, max = 1.89923 ms, mean = 1.65611 ms, median = 1.6371 ms, percentile(90%) = 1.72449 ms, percentile(95%) = 1.84314 ms, percentile(99%) = 1.89624 ms
[07/29/2025-18:57:59] [I] H2D Latency: min = 0.383667 ms, max = 0.389648 ms, mean = 0.385992 ms, median = 0.385986 ms, percentile(90%) = 0.388062 ms, percentile(95%) = 0.388428 ms, percentile(99%) = 0.389404 ms
[07/29/2025-18:57:59] [I] GPU Compute Time: min = 22.8389 ms, max = 25.5097 ms, mean = 23.3507 ms, median = 23.2883 ms, percentile(90%) = 23.7808 ms, percentile(95%) = 23.9138 ms, percentile(99%) = 25.4509 ms
[07/29/2025-18:57:59] [I] D2H Latency: min = 0.0195312 ms, max = 0.0248413 ms, mean = 0.0226805 ms, median = 0.0227051 ms, percentile(90%) = 0.0239258 ms, percentile(95%) = 0.0241699 ms, percentile(99%) = 0.0245361 ms
[07/29/2025-18:57:59] [I] Total Host Walltime: 3.05927 s
[07/29/2025-18:57:59] [I] Total GPU Compute Time: 3.03559 s
[07/29/2025-18:57:59] [W] * GPU compute time is unstable, with coefficient of variance = 1.87985%.
[07/29/2025-18:57:59] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/29/2025-18:57:59] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/29/2025-18:57:59] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v100300] # trtexec --loadEngine=deepstream/model/vale.engine