Engine failed to match config params, trying rebuild

Please provide complete information as applicable to your setup.

**• Hardware Platform (Jetson / GPU) - Jetson Xavier NX
**• DeepStream Version - 6.3
**• JetPack Version (valid for Jetson only) - 5.1.2
**• TensorRT Version - 8.5.2
• NVIDIA GPU Driver Version (valid for GPU only)
**• Issue Type( questions, new requirements, bugs) - Question
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
**• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description) - deepstream-test5-app

Hi,

1- I followed this repo: GitHub - WongKinYiu/yolov7: Implementation of paper - YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors and trained custom yolov7-tiny model.

  • Train command: python train.py --workers 8 --device 0 --batch-size 16 --data data/custom.yaml --img 640 480 --cfg cfg/training/yolov7-tiny.yaml --weights ‘yolov7-tiny.pt’ --name yolov7-tiny-custom --hyp data/hyp.scratch.tiny.yaml

2- I reparameterized the custom model as recommended by the repository.

3- I followed this repo: yolo_deepstream/yolov7_qat at main · NVIDIA-AI-IOT/yolo_deepstream · GitHub

I converted the qat model to int8 model with the following commands:

  • sudo /usr/src/tensorrt/bin/trtexec --onnx=qat.onnx --int8 --saveEngine=yolov7_tiny_qat_3.engine --workspace=1024000
  • sudo /usr/src/tensorrt/bin/trtexec --onnx=qat.onnx --int8 --fp16 --saveEngine=yolov7_tiny_qat_2.engine --workspace=1024000 --minShapes=images:4x3x640x640 --optShapes=images:4x3x640x640 --maxShapes=images:4x3x640x640
  • sudo /usr/src/tensorrt/bin/trtexec --onnx=qat.onnx --int8 --saveEngine=yolov7_tiny_qat.engine --workspace=1024000 --minShapes=images:4x3x640x640 --optShapes=images:4x3x640x640 --maxShapes=images:4x3x640x640

The engine file was created successfully every time.
When i check the model with this: sudo /usr/src/tensorrt/bin/trtexec --loadEngine=/opt/nvidia/deepstream/deepstream-6.3/sources/objectDetector_Yolo/yolov7/yolov7_tiny_qat.engine --plugins=./libmyplugins.so i got this:

[07/05/2024-14:58:30] [I] === Performance summary ===
[07/05/2024-14:58:30] [I] Throughput: 34.7856 qps
[07/05/2024-14:58:30] [I] Latency: min = 24.8691 ms, max = 42.3534 ms, mean = 28.7271 ms, median = 25.0518 ms, percentile(90%) = 39.1527 ms, percentile(95%) = 39.1918 ms, percentile(99%) = 39.2491 ms
[07/05/2024-14:58:30] [I] Enqueue Time: min = 3.00171 ms, max = 5.73328 ms, mean = 3.9051 ms, median = 3.82019 ms, percentile(90%) = 4.84863 ms, percentile(95%) = 5.13629 ms, percentile(99%) = 5.36453 ms
[07/05/2024-14:58:30] [I] H2D Latency: min = 0.703857 ms, max = 1.16614 ms, mean = 0.820026 ms, median = 0.720337 ms, percentile(90%) = 1.16443 ms, percentile(95%) = 1.16507 ms, percentile(99%) = 1.16565 ms
[07/05/2024-14:58:30] [I] GPU Compute Time: min = 24.0393 ms, max = 40.9948 ms, mean = 27.7691 ms, median = 24.2098 ms, percentile(90%) = 37.7932 ms, percentile(95%) = 37.8299 ms, percentile(99%) = 37.8911 ms
[07/05/2024-14:58:30] [I] D2H Latency: min = 0.11731 ms, max = 0.19989 ms, mean = 0.137928 ms, median = 0.123291 ms, percentile(90%) = 0.195007 ms, percentile(95%) = 0.196533 ms, percentile(99%) = 0.198792 ms
[07/05/2024-14:58:30] [I] Total Host Walltime: 3.07598 s
[07/05/2024-14:58:30] [I] Total GPU Compute Time: 2.97129 s
[07/05/2024-14:58:30] [W] * GPU compute time is unstable, with coefficient of variance = 19.9549%.
[07/05/2024-14:58:30] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/05/2024-14:58:30] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/05/2024-14:58:30] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=/opt/nvidia/deepstream/deepstream-6.3/sources/objectDetector_Yolo/yolov7/yolov7_tiny_qat.engine --plugins=./libmyplugins.so

4- When I want to test the created engine file with the deepstream-test5-app application, I get the following error:

Unknown or legacy key specified ‘is-classifier’ for group [property]
Unknown or legacy key specified ‘disable-output-host-copy’ for group [property]
gstnvtracker: Loading low-level lib at /opt/nvidia/deepstream/deepstream/lib/libnvds_nvmultiobjecttracker.so
[NvMultiObjectTracker] Initialized
0:00:08.013987301 8575 0xaaaafff84580 INFO nvinfer gstnvinfer.cpp:682:gst_nvinfer_logger:<primary_gie> NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::deserializeEngineAndBackend() <nvdsinfer_context_impl.cpp:1988> [UID = 1]: deserialized trt engine from :/opt/nvidia/deepstream/deepstream-6.3/sources/objectDetector_Yolo/yolov7/yolov7_tiny_qat_3.engine
WARNING: [TRT]: The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
INFO: [Implicit Engine Info]: layers num: 2
0 INPUT kFLOAT images 3x640x640
1 OUTPUT kFLOAT outputs 25200x7

0:00:08.093343861 8575 0xaaaafff84580 WARN nvinfer gstnvinfer.cpp:679:gst_nvinfer_logger:<primary_gie> NvDsInferContext[UID 1]: Warning from NvDsInferContextImpl::checkBackendParams() <nvdsinfer_context_impl.cpp:1920> [UID = 1]: Backend has maxBatchSize 1 whereas 16 has been requested
0:00:08.093459190 8575 0xaaaafff84580 WARN nvinfer gstnvinfer.cpp:679:gst_nvinfer_logger:<primary_gie> NvDsInferContext[UID 1]: Warning from NvDsInferContextImpl::generateBackendContext() <nvdsinfer_context_impl.cpp:2097> [UID = 1]: deserialized backend context :/opt/nvidia/deepstream/deepstream-6.3/sources/objectDetector_Yolo/yolov7/yolov7_tiny_qat_3.engine failed to match config params, trying rebuild
0:00:08.122447330 8575 0xaaaafff84580 INFO nvinfer gstnvinfer.cpp:682:gst_nvinfer_logger:<primary_gie> NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::buildModel() <nvdsinfer_context_impl.cpp:2002> [UID = 1]: Trying to create engine from model files
ERROR: failed to build network since there is no model file matched.
ERROR: failed to build network.
0:00:09.446987646 8575 0xaaaafff84580 ERROR nvinfer gstnvinfer.cpp:676:gst_nvinfer_logger:<primary_gie> NvDsInferContext[UID 1]: Error in NvDsInferContextImpl::buildModel() <nvdsinfer_context_impl.cpp:2022> [UID = 1]: build engine file failed
0:00:09.520248994 8575 0xaaaafff84580 ERROR nvinfer gstnvinfer.cpp:676:gst_nvinfer_logger:<primary_gie> NvDsInferContext[UID 1]: Error in NvDsInferContextImpl::generateBackendContext() <nvdsinfer_context_impl.cpp:2108> [UID = 1]: build backend context failed
0:00:09.520404612 8575 0xaaaafff84580 ERROR nvinfer gstnvinfer.cpp:676:gst_nvinfer_logger:<primary_gie> NvDsInferContext[UID 1]: Error in NvDsInferContextImpl::initialize() <nvdsinfer_context_impl.cpp:1282> [UID = 1]: generate backend failed, check config file settings
0:00:09.520527044 8575 0xaaaafff84580 WARN nvinfer gstnvinfer.cpp:898:gst_nvinfer_start:<primary_gie> error: Failed to create NvDsInferContext instance
0:00:09.520577637 8575 0xaaaafff84580 WARN nvinfer gstnvinfer.cpp:898:gst_nvinfer_start:<primary_gie> error: Config file path: /opt/nvidia/deepstream/deepstream-6.3/sources/objectDetector_Yolo/yolov7/config_infer_primary_yoloV7_tiny.txt, NvDsInfer Error: NVDSINFER_CONFIG_FAILED
[NvMultiObjectTracker] De-initialized
** ERROR: main:1534: Failed to set pipeline to PAUSED
Quitting
ERROR from primary_gie: Failed to create NvDsInferContext instance
Debug info: /dvs/git/dirty/git-master_linux/deepstream/sdk/src/gst-plugins/gst-nvinfer/gstnvinfer.cpp(898): gst_nvinfer_start (): /GstPipeline:pipeline/GstBin:primary_gie_bin/GstNvInfer:primary_gie:
Config file path: /opt/nvidia/deepstream/deepstream-6.3/sources/objectDetector_Yolo/yolov7/config_infer_primary_yoloV7_tiny.txt, NvDsInfer Error: NVDSINFER_CONFIG_FAILED
App run failed

(.pt → .onnx → fp16(using deepstream) works without any problems.)

config_infer_primary_yoloV7_tiny.txt (4.1 KB)

I can’t understand why the int8 model doesn’t work.

Thanks for help.

Could you set the batch-size=16 in your config file?

Thanks for help.

Setting the batch-size parameter to 16 in the config_infer_primary_yoloV7_tiny.txt file did not work. However, this reminded me that the primary-gie/batch-size parameter in the deepstream_app_config_yoloV7_tiny.txt file is set to 16. When I removed this parameter, it turned out that this was the parameter that did not match the model. The int8 model is currently working.

At this point, my other question is:
.pt → .onnx → deepstream → fp16 = 6.48 FPS with 25 video sources
.pt → qat.onnx → trtexec → int8 = 3.94 FPS with 25 video sources

While other parameters are exactly the same, is the FPS difference due to the conversion method? Is it normal that there is a difference in FPS between converting using trtexec and deepstream?

Can you describe in detail how you got these 2 values of FPS?

I measured using the perf-measurement-interval-sec parameter.

deepstream_app_config_yoloV7_tiny.txt (7.4 KB)

fp16 →
config_infer_primary_yoloV7_tiny.txt (4.2 KB)

int8 →
config_infer_primary_yoloV7_tiny.txt (4.1 KB)

I tested each model with deepstream-test5-app.

What’s your whole command to test the trtexec perf?

I ran the following command: sudo /usr/src/tensorrt/bin/trtexec --onnx=qat.onnx --int8 --saveEngine=yolov7_tiny_qat_3.engine --workspace=1024000 and created the int8 engine file. I then tested the engine by running the following command: sudo /opt/nvidia/deepstream/deepstream-6.3/sources/apps/sample_apps/deepstream-test5/deepstream-test5-app -c /opt/nvidia/deepstream/deepstream-6.3/sources/objectDetector_Yolo/yolov7-tiny/deepstream_app_config_yoloV7_tiny.txt.

I followed the following outputs from the terminal(example):
**PERF: 19.45 (19.00) 19.45 (19.04) 19.45 (19.08) 19.40 (19.04) 19.47 (19.10) 19.47 (18.98) 19.47 (19.06) 19.47 (19.04) 19.47 (18.99) 19.41 (19.10) 19.41 (19.06) 19.43 (19.07) 19.43 (18.97) 19.43 (19.08) 19.43 (19.07) 19.40 (19.08) 19.38 (19.10) 19.44 (19.06) 19.44 (18.99) 19.44 (19.01) 19.44 (19.05) 19.44 (19.04) 19.39 (19.10) 19.45 (19.07) 19.45 (19.10) and got the results I mentioned above.

I know how DeepStream got the fps. I mean what is your whole command to test the trtexec command perf(the 2nd pipeline you attached)?

I guess that is what you mean.

sudo /usr/src/tensorrt/bin/trtexec --loadEngine=/opt/nvidia/deepstream/deepstream-6.3/sources/objectDetector_Yolo/yolov7-tiny/yolov7_tiny_qat_3.engine

Output:

&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=/opt/nvidia/deepstream/deepstream-6.3/sources/objectDetector_Yolo/yolov7-tiny/yolov7_tiny_qat_3.engine
[07/08/2024-13:41:01] [I] === Model Options ===
[07/08/2024-13:41:01] [I] Format: *
[07/08/2024-13:41:01] [I] Model:
[07/08/2024-13:41:01] [I] Output:
[07/08/2024-13:41:01] [I] === Build Options ===
[07/08/2024-13:41:01] [I] Max batch: 1
[07/08/2024-13:41:01] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[07/08/2024-13:41:01] [I] minTiming: 1
[07/08/2024-13:41:01] [I] avgTiming: 8
[07/08/2024-13:41:01] [I] Precision: FP32
[07/08/2024-13:41:01] [I] LayerPrecisions:
[07/08/2024-13:41:01] [I] Calibration:
[07/08/2024-13:41:01] [I] Refit: Disabled
[07/08/2024-13:41:01] [I] Sparsity: Disabled
[07/08/2024-13:41:01] [I] Safe mode: Disabled
[07/08/2024-13:41:01] [I] DirectIO mode: Disabled
[07/08/2024-13:41:01] [I] Restricted mode: Disabled
[07/08/2024-13:41:01] [I] Build only: Disabled
[07/08/2024-13:41:01] [I] Save engine:
[07/08/2024-13:41:01] [I] Load engine: /opt/nvidia/deepstream/deepstream-6.3/sources/objectDetector_Yolo/yolov7-tiny/yolov7_tiny_qat_3.engine
[07/08/2024-13:41:01] [I] Profiling verbosity: 0
[07/08/2024-13:41:01] [I] Tactic sources: Using default tactic sources
[07/08/2024-13:41:01] [I] timingCacheMode: local
[07/08/2024-13:41:01] [I] timingCacheFile:
[07/08/2024-13:41:01] [I] Heuristic: Disabled
[07/08/2024-13:41:01] [I] Preview Features: Use default preview flags.
[07/08/2024-13:41:01] [I] Input(s)s format: fp32:CHW
[07/08/2024-13:41:01] [I] Output(s)s format: fp32:CHW
[07/08/2024-13:41:01] [I] Input build shapes: model
[07/08/2024-13:41:01] [I] Input calibration shapes: model
[07/08/2024-13:41:01] [I] === System Options ===
[07/08/2024-13:41:01] [I] Device: 0
[07/08/2024-13:41:01] [I] DLACore:
[07/08/2024-13:41:01] [I] Plugins:
[07/08/2024-13:41:01] [I] === Inference Options ===
[07/08/2024-13:41:01] [I] Batch: 1
[07/08/2024-13:41:01] [I] Input inference shapes: model
[07/08/2024-13:41:01] [I] Iterations: 10
[07/08/2024-13:41:01] [I] Duration: 3s (+ 200ms warm up)
[07/08/2024-13:41:01] [I] Sleep time: 0ms
[07/08/2024-13:41:01] [I] Idle time: 0ms
[07/08/2024-13:41:01] [I] Streams: 1
[07/08/2024-13:41:01] [I] ExposeDMA: Disabled
[07/08/2024-13:41:01] [I] Data transfers: Enabled
[07/08/2024-13:41:01] [I] Spin-wait: Disabled
[07/08/2024-13:41:01] [I] Multithreading: Disabled
[07/08/2024-13:41:01] [I] CUDA Graph: Disabled
[07/08/2024-13:41:01] [I] Separate profiling: Disabled
[07/08/2024-13:41:01] [I] Time Deserialize: Disabled
[07/08/2024-13:41:01] [I] Time Refit: Disabled
[07/08/2024-13:41:01] [I] NVTX verbosity: 0
[07/08/2024-13:41:01] [I] Persistent Cache Ratio: 0
[07/08/2024-13:41:01] [I] Inputs:
[07/08/2024-13:41:01] [I] === Reporting Options ===
[07/08/2024-13:41:01] [I] Verbose: Disabled
[07/08/2024-13:41:01] [I] Averages: 10 inferences
[07/08/2024-13:41:01] [I] Percentiles: 90,95,99
[07/08/2024-13:41:01] [I] Dump refittable layers:Disabled
[07/08/2024-13:41:01] [I] Dump output: Disabled
[07/08/2024-13:41:01] [I] Profile: Disabled
[07/08/2024-13:41:01] [I] Export timing to JSON file:
[07/08/2024-13:41:01] [I] Export output to JSON file:
[07/08/2024-13:41:01] [I] Export profile to JSON file:
[07/08/2024-13:41:01] [I]
[07/08/2024-13:41:01] [I] === Device Information ===
[07/08/2024-13:41:01] [I] Selected Device: Xavier
[07/08/2024-13:41:01] [I] Compute Capability: 7.2
[07/08/2024-13:41:01] [I] SMs: 6
[07/08/2024-13:41:01] [I] Compute Clock Rate: 1.109 GHz
[07/08/2024-13:41:01] [I] Device Global Memory: 6845 MiB
[07/08/2024-13:41:01] [I] Shared Memory per SM: 96 KiB
[07/08/2024-13:41:01] [I] Memory Bus Width: 256 bits (ECC disabled)
[07/08/2024-13:41:01] [I] Memory Clock Rate: 1.109 GHz
[07/08/2024-13:41:01] [I]
[07/08/2024-13:41:01] [I] TensorRT version: 8.5.2
[07/08/2024-13:41:01] [I] Engine loaded in 0.209995 sec.
[07/08/2024-13:41:03] [I] [TRT] Loaded engine size: 7 MiB
[07/08/2024-13:41:03] [W] [TRT] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[07/08/2024-13:41:06] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +343, GPU +269, now: CPU 595, GPU 3536 (MiB)
[07/08/2024-13:41:06] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +8, now: CPU 0, GPU 8 (MiB)
[07/08/2024-13:41:06] [I] Engine deserialized in 4.57182 sec.
[07/08/2024-13:41:06] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 595, GPU 3536 (MiB)
[07/08/2024-13:41:06] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +137, now: CPU 0, GPU 145 (MiB)
[07/08/2024-13:41:06] [I] Setting persistentCacheLimit to 0 bytes.
[07/08/2024-13:41:06] [I] Using random values for input images
[07/08/2024-13:41:06] [I] Created input binding for images with dimensions 1x3x640x640
[07/08/2024-13:41:06] [I] Using random values for output outputs
[07/08/2024-13:41:06] [I] Created output binding for outputs with dimensions 1x25200x7
[07/08/2024-13:41:06] [I] Starting inference
[07/08/2024-13:41:09] [I] Warmup completed 1 queries over 200 ms
[07/08/2024-13:41:09] [I] Timing trace has 332 queries over 2.88795 s
[07/08/2024-13:41:09] [I]
[07/08/2024-13:41:09] [I] === Trace details ===
[07/08/2024-13:41:09] [I] Trace averages of 10 runs:
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 18.9359 ms - Host latency: 19.4975 ms (enqueue 3.61808 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 11.1465 ms - Host latency: 11.4503 ms (enqueue 3.27811 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 10.708 ms - Host latency: 11.0055 ms (enqueue 3.2691 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 7.82712 ms - Host latency: 8.04069 ms (enqueue 3.02569 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 8.08832 ms - Host latency: 8.30468 ms (enqueue 3.19858 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 7.82518 ms - Host latency: 8.03963 ms (enqueue 2.97252 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 7.82941 ms - Host latency: 8.04319 ms (enqueue 2.8594 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 7.82692 ms - Host latency: 8.04103 ms (enqueue 3.00575 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 7.83351 ms - Host latency: 8.04733 ms (enqueue 3.2326 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 7.82939 ms - Host latency: 8.04318 ms (enqueue 2.96793 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 7.82729 ms - Host latency: 8.04154 ms (enqueue 3.36674 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 7.81897 ms - Host latency: 8.03446 ms (enqueue 3.22677 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 8.05098 ms - Host latency: 8.26477 ms (enqueue 3.12705 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 7.8183 ms - Host latency: 8.03151 ms (enqueue 3.09554 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 7.81986 ms - Host latency: 8.03369 ms (enqueue 3.08525 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 7.82544 ms - Host latency: 8.03925 ms (enqueue 3.03726 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 7.81415 ms - Host latency: 8.02954 ms (enqueue 3.51119 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 7.83386 ms - Host latency: 8.0481 ms (enqueue 3.04595 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 7.82877 ms - Host latency: 8.04205 ms (enqueue 3.0391 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 8.08694 ms - Host latency: 8.30098 ms (enqueue 3.11271 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 7.83005 ms - Host latency: 8.04502 ms (enqueue 2.90305 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 7.82705 ms - Host latency: 8.0407 ms (enqueue 3.16528 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 7.84851 ms - Host latency: 8.06831 ms (enqueue 3.1854 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 8.36785 ms - Host latency: 8.59512 ms (enqueue 3.91667 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 9.11221 ms - Host latency: 9.34561 ms (enqueue 4.39216 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 7.9311 ms - Host latency: 8.14849 ms (enqueue 3.20911 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 8.04946 ms - Host latency: 8.27036 ms (enqueue 3.18606 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 8.04055 ms - Host latency: 8.26262 ms (enqueue 3.25076 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 8.04089 ms - Host latency: 8.26099 ms (enqueue 3.24097 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 7.82808 ms - Host latency: 8.04282 ms (enqueue 3.14546 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 7.82832 ms - Host latency: 8.04263 ms (enqueue 2.95879 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 7.82993 ms - Host latency: 8.04583 ms (enqueue 3.20991 ms)
[07/08/2024-13:41:09] [I] Average on 10 runs - GPU latency: 7.83213 ms - Host latency: 8.04702 ms (enqueue 2.998 ms)
[07/08/2024-13:41:09] [I]
[07/08/2024-13:41:09] [I] === Performance summary ===
[07/08/2024-13:41:09] [I] Throughput: 114.961 qps
[07/08/2024-13:41:09] [I] Latency: min = 7.96069 ms, max = 23.632 ms, mean = 8.68096 ms, median = 8.04974 ms, percentile(90%) = 9.97339 ms, percentile(95%) = 11.4469 ms, percentile(99%) = 23.5274 ms
[07/08/2024-13:41:09] [I] Enqueue Time: min = 2.78937 ms, max = 8.35522 ms, mean = 3.2051 ms, median = 3.0647 ms, percentile(90%) = 3.54047 ms, percentile(95%) = 3.7323 ms, percentile(99%) = 5.2605 ms
[07/08/2024-13:41:09] [I] H2D Latency: min = 0.177856 ms, max = 0.580231 ms, mean = 0.195686 ms, median = 0.180664 ms, percentile(90%) = 0.227844 ms, percentile(95%) = 0.256378 ms, percentile(99%) = 0.578369 ms
[07/08/2024-13:41:09] [I] GPU Compute Time: min = 7.74707 ms, max = 22.9504 ms, mean = 8.44914 ms, median = 7.83508 ms, percentile(90%) = 9.75623 ms, percentile(95%) = 11.1423 ms, percentile(99%) = 22.8456 ms
[07/08/2024-13:41:09] [I] D2H Latency: min = 0.0314941 ms, max = 0.103943 ms, mean = 0.0361373 ms, median = 0.0339355 ms, percentile(90%) = 0.0368652 ms, percentile(95%) = 0.0470581 ms, percentile(99%) = 0.100922 ms
[07/08/2024-13:41:09] [I] Total Host Walltime: 2.88795 s
[07/08/2024-13:41:09] [I] Total GPU Compute Time: 2.80511 s
[07/08/2024-13:41:09] [W] * GPU compute time is unstable, with coefficient of variance = 26.4652%.
[07/08/2024-13:41:09] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/08/2024-13:41:09] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/08/2024-13:41:09] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=/opt/nvidia/deepstream/deepstream-6.3/sources/objectDetector_Yolo/yolov7-tiny/yolov7_tiny_qat_3.engine

Just from the trtexec command, you did not configure batch-id and so on.