DeepStream App Performance Drop with Custom YOLOv11 Model

We developed a simple Python application using DeepStream, based on the deepstream_test_1 example from the deepstream_python_apps repository. It worked flawlessly with the TrafficCamNet model (TrafficCamNet | NVIDIA NGC), delivering excellent performance at approximately 600 FPS.

We then modified the application to process video using our custom YOLOv11 model, following the guidance provided in this documentation: Ultralytics YOLO11 on NVIDIA Jetson using DeepStream SDK and TensorRT.

While the integration was successful, we observed a significant drop in performance—from 600 FPS down to around 40 FPS.

We would like to optimize our application to achieve better performance with our custom model. Below are our configuration files and pipeline setup, also our hardware specs. For testing, we are using an .MP4 video file.

deepstream-app version 7.1.0
DeepStreamSDK 7.1.0
CUDA Driver Version: 12.6
CUDA Runtime Version: 12.6
TensorRT Version: 10.3
cuDNN Version: 9.0
libNVWarp360 Version: 2.0.1d3
GPU: Tesla T4 - 16 GB

 pipeline = Gst.parse_launch(
        f"filesrc location={file_path} ! qtdemux name=demux "
        "demux.video_0 ! h264parse ! nvv4l2decoder ! nvvideoconvert ! "
        f"video/x-raw(memory:NVMM),format=NV12,width={width},height={height} ! "
        "queue ! mux.sink_0 "
        f"nvstreammux name=mux batch-size=1 width={width} height={height} live-source=0 batched-push-timeout=40000 ! "
        "nvinfer config-file-path=/opt/nvidia/deepstream/deepstream/sources/vale/config.txt ! "
        "nvdsosd name=osd ! nvvideoconvert ! "
        "video/x-raw(memory:NVMM),format=NV12 ! "
        "nvv4l2h264enc ! h264parse ! qtmux ! "
        f"filesink location=output.mp4"
    )
    osd = pipeline.get_by_name("osd")
    if osd:
        sinkpad = osd.get_static_pad("sink")
        if sinkpad:
            sinkpad.add_probe(Gst.PadProbeType.BUFFER, process_video, 0)

config_custom_model.txt (639 Bytes)
config_traffic_cam.txt (592 Bytes)

1 Like

Could you run the command below to check the perf of your model first?

trtexec --loadEngine=vale.engine --batch=1 --fp16

Hi there!

I’m working together with rodrsouza on this, and I just tried to run the command and at first I got the following error due to the --batch=1:

[07/25/2025-13:40:28] [E] Unknown option: --batch 1
&&&& FAILED TensorRT.trtexec [TensorRT v100300] # trtexec --loadEngine=vale.engine --batch=1 --fp16

But when I tried to use trtexec --loadEngine=vale.engine --fp16, I got the following output:

&&&& RUNNING TensorRT.trtexec [TensorRT v100300] # trtexec --loadEngine=vale.engine --fp16
[07/25/2025-13:22:10] [I] === Model Options ===
[07/25/2025-13:22:10] [I] Format: *
[07/25/2025-13:22:10] [I] Model: 
[07/25/2025-13:22:10] [I] Output:
[07/25/2025-13:22:10] [I] 
[07/25/2025-13:22:10] [I] === System Options ===
[07/25/2025-13:22:10] [I] Device: 0
[07/25/2025-13:22:10] [I] DLACore: 
[07/25/2025-13:22:10] [I] Plugins:
[07/25/2025-13:22:10] [I] setPluginsToSerialize:
[07/25/2025-13:22:10] [I] dynamicPlugins:
[07/25/2025-13:22:10] [I] ignoreParsedPluginLibs: 0
[07/25/2025-13:22:10] [I] 
[07/25/2025-13:22:10] [I] === Inference Options ===
[07/25/2025-13:22:10] [I] Batch: Explicit
[07/25/2025-13:22:10] [I] Input inference shapes: model
[07/25/2025-13:22:10] [I] Iterations: 10
[07/25/2025-13:22:10] [I] Duration: 3s (+ 200ms warm up)
[07/25/2025-13:22:10] [I] Sleep time: 0ms
[07/25/2025-13:22:10] [I] Idle time: 0ms
[07/25/2025-13:22:10] [I] Inference Streams: 1
[07/25/2025-13:22:10] [I] ExposeDMA: Disabled
[07/25/2025-13:22:10] [I] Data transfers: Enabled
[07/25/2025-13:22:10] [I] Spin-wait: Disabled
[07/25/2025-13:22:10] [I] Multithreading: Disabled
[07/25/2025-13:22:10] [I] CUDA Graph: Disabled
[07/25/2025-13:22:10] [I] Separate profiling: Disabled
[07/25/2025-13:22:10] [I] Time Deserialize: Disabled
[07/25/2025-13:22:10] [I] Time Refit: Disabled
[07/25/2025-13:22:10] [I] NVTX verbosity: 0
[07/25/2025-13:22:10] [I] Persistent Cache Ratio: 0
[07/25/2025-13:22:10] [I] Optimization Profile Index: 0
[07/25/2025-13:22:10] [I] Weight Streaming Budget: 100.000000%
[07/25/2025-13:22:10] [I] Inputs:
[07/25/2025-13:22:10] [I] Debug Tensor Save Destinations:
[07/25/2025-13:22:10] [I] === Reporting Options ===
[07/25/2025-13:22:10] [I] Verbose: Disabled
[07/25/2025-13:22:10] [I] Averages: 10 inferences
[07/25/2025-13:22:10] [I] Percentiles: 90,95,99
[07/25/2025-13:22:10] [I] Dump refittable layers:Disabled
[07/25/2025-13:22:10] [I] Dump output: Disabled
[07/25/2025-13:22:10] [I] Profile: Disabled
[07/25/2025-13:22:10] [I] Export timing to JSON file: 
[07/25/2025-13:22:10] [I] Export output to JSON file: 
[07/25/2025-13:22:10] [I] Export profile to JSON file: 
[07/25/2025-13:22:10] [I] 
[07/25/2025-13:22:10] [I] === Device Information ===
[07/25/2025-13:22:11] [I] Available Devices: 
[07/25/2025-13:22:11] [I]   Device 0: "Tesla T4" UUID: GPU-bbd80375-ec3c-7309-6621-4f2970f1d139
[07/25/2025-13:22:11] [I] Selected Device: Tesla T4
[07/25/2025-13:22:11] [I] Selected Device ID: 0
[07/25/2025-13:22:11] [I] Selected Device UUID: GPU-bbd80375-ec3c-7309-6621-4f2970f1d139
[07/25/2025-13:22:11] [I] Compute Capability: 7.5
[07/25/2025-13:22:11] [I] SMs: 40
[07/25/2025-13:22:11] [I] Device Global Memory: 14930 MiB
[07/25/2025-13:22:11] [I] Shared Memory per SM: 64 KiB
[07/25/2025-13:22:11] [I] Memory Bus Width: 256 bits (ECC enabled)
[07/25/2025-13:22:11] [I] Application Compute Clock Rate: 1.59 GHz
[07/25/2025-13:22:11] [I] Application Memory Clock Rate: 5.001 GHz
[07/25/2025-13:22:11] [I] 
[07/25/2025-13:22:11] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[07/25/2025-13:22:11] [I] 
[07/25/2025-13:22:11] [I] TensorRT version: 10.3.0
[07/25/2025-13:22:11] [I] Loading standard plugins
LLVM ERROR: out of memory

Could you attach your complete command of generating the onnx file? I have run the command below, it worked well on my deivice.

trtexec --loadEngine=model_b1_gpu0_fp32.engine

Initially, our .onnx file was a .pt model, where we converted it by using this Deep Stream Yolo Repository, following the documentation of Ultalytics’ Deepstream on Nvidia Jetson Doc.

I have just tried it again with the command trtexec --loadEngine=vale.engine and once again got this output:

root@ip-10-224-22-91:/opt/nvidia/deepstream/deepstream-7.1/samples/models# trtexec --loadEngine=vale.engine 
&&&& RUNNING TensorRT.trtexec [TensorRT v100300] # trtexec --loadEngine=vale.engine
[07/28/2025-11:49:19] [I] === Model Options ===
[07/28/2025-11:49:19] [I] Format: *
[07/28/2025-11:49:19] [I] Model: 
[07/28/2025-11:49:19] [I] Output:
[07/28/2025-11:49:19] [I] 
[07/28/2025-11:49:19] [I] === System Options ===
[07/28/2025-11:49:19] [I] Device: 0
[07/28/2025-11:49:19] [I] DLACore: 
[07/28/2025-11:49:19] [I] Plugins:
[07/28/2025-11:49:19] [I] setPluginsToSerialize:
[07/28/2025-11:49:19] [I] dynamicPlugins:
[07/28/2025-11:49:19] [I] ignoreParsedPluginLibs: 0
[07/28/2025-11:49:19] [I] 
[07/28/2025-11:49:19] [I] === Inference Options ===
[07/28/2025-11:49:19] [I] Batch: Explicit
[07/28/2025-11:49:19] [I] Input inference shapes: model
[07/28/2025-11:49:19] [I] Iterations: 10
[07/28/2025-11:49:19] [I] Duration: 3s (+ 200ms warm up)
[07/28/2025-11:49:19] [I] Sleep time: 0ms
[07/28/2025-11:49:19] [I] Idle time: 0ms
[07/28/2025-11:49:19] [I] Inference Streams: 1
[07/28/2025-11:49:19] [I] ExposeDMA: Disabled
[07/28/2025-11:49:19] [I] Data transfers: Enabled
[07/28/2025-11:49:19] [I] Spin-wait: Disabled
[07/28/2025-11:49:19] [I] Multithreading: Disabled
[07/28/2025-11:49:19] [I] CUDA Graph: Disabled
[07/28/2025-11:49:19] [I] Separate profiling: Disabled
[07/28/2025-11:49:19] [I] Time Deserialize: Disabled
[07/28/2025-11:49:19] [I] Time Refit: Disabled
[07/28/2025-11:49:19] [I] NVTX verbosity: 0
[07/28/2025-11:49:19] [I] Persistent Cache Ratio: 0
[07/28/2025-11:49:19] [I] Optimization Profile Index: 0
[07/28/2025-11:49:19] [I] Weight Streaming Budget: 100.000000%
[07/28/2025-11:49:19] [I] Inputs:
[07/28/2025-11:49:19] [I] Debug Tensor Save Destinations:
[07/28/2025-11:49:19] [I] === Reporting Options ===
[07/28/2025-11:49:19] [I] Verbose: Disabled
[07/28/2025-11:49:19] [I] Averages: 10 inferences
[07/28/2025-11:49:19] [I] Percentiles: 90,95,99
[07/28/2025-11:49:19] [I] Dump refittable layers:Disabled
[07/28/2025-11:49:19] [I] Dump output: Disabled
[07/28/2025-11:49:19] [I] Profile: Disabled
[07/28/2025-11:49:19] [I] Export timing to JSON file: 
[07/28/2025-11:49:19] [I] Export output to JSON file: 
[07/28/2025-11:49:19] [I] Export profile to JSON file: 
[07/28/2025-11:49:19] [I] 
[07/28/2025-11:49:19] [I] === Device Information ===
[07/28/2025-11:49:19] [I] Available Devices: 
[07/28/2025-11:49:19] [I]   Device 0: "Tesla T4" UUID: GPU-6bfbbe3b-bf4d-3531-d2d6-7a1bf7e6e719
[07/28/2025-11:49:19] [I] Selected Device: Tesla T4
[07/28/2025-11:49:19] [I] Selected Device ID: 0
[07/28/2025-11:49:19] [I] Selected Device UUID: GPU-6bfbbe3b-bf4d-3531-d2d6-7a1bf7e6e719
[07/28/2025-11:49:19] [I] Compute Capability: 7.5
[07/28/2025-11:49:19] [I] SMs: 40
[07/28/2025-11:49:19] [I] Device Global Memory: 14930 MiB
[07/28/2025-11:49:19] [I] Shared Memory per SM: 64 KiB
[07/28/2025-11:49:19] [I] Memory Bus Width: 256 bits (ECC enabled)
[07/28/2025-11:49:19] [I] Application Compute Clock Rate: 1.59 GHz
[07/28/2025-11:49:19] [I] Application Memory Clock Rate: 5.001 GHz
[07/28/2025-11:49:19] [I] 
[07/28/2025-11:49:19] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[07/28/2025-11:49:19] [I] 
[07/28/2025-11:49:19] [I] TensorRT version: 10.3.0
[07/28/2025-11:49:19] [I] Loading standard plugins
LLVM ERROR: out of memory
Aborted (core dumped)

Hello!
Here is also all commands followed to generate the .ONNX file:

cd ~
git clone https://github.com/ultralytics/ultralytics
cd ultralytics
pip3 install -e ".[export]" onnxslim
cd ~
git clone https://github.com/marcoslucianops/DeepStream-Yolo
cp ~/DeepStream-Yolo/utils/export_yolo11.py ~/ultralytics
cp vale.pt ultralytics
cd ultralytics
pip3 install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0
python3 export_yolo11.py -w vale.pt
cp vale.pt.onnx labels.txt ~/DeepStream-Yolo
cd ~/DeepStream-Yolo
export CUDA_VER=12.6
make -C nvdsinfer_custom_impl_Yolo clean && make -C nvdsinfer_custom_impl_Yolo

While the integration was successful, we observed a significant drop in performance—from 600 FPS down to around 40 FPS.

My two cents on this matter:

Based on your performance drop from 600 FPS to 40 FPS, this is actually expected when comparing these different architectures.

Key differences between TrafficCamNet and YOLOv11:

  • TrafficCamNet uses a much lighter architecture (ResNet-18) with built-in pruning and Quantization Aware Training (QAT) through TAO toolkit
  • YOLOv11 native models lack quantization and come in various architectures (n/s/m/l/x variants) with different complexity levels

Performance context: The YOLOv11 variants range from Nano (fastest, edge-optimized) to Extra-Large (highest accuracy, most computationally demanding). Even the Nano variant will be significantly slower than DetectNet_V2/TrafficCamNet due to architectural differences.

For reference, YOLOv7 with NVIDIA’s QAT typically offers the best YOLO performance, while newer versions (v9, v10, v11) generally have worse performance characteristics.

Recommendation: Rather than comparing these fundamentally different architectures, consider that DetectNet_V2 with quantized ResNet-18 is inherently much faster than YOLOv11. If you need YOLO-family performance, consider YOLOv7 with quantization, but expect it to still be slower than your original TAO model.

Check this links

1 Like

@Levi_Pereira thanks for your valuable suggestions. Our initial analysis also indicates that the cause is related to the model architecture. Therefore, we need to run the performance test using trtexec.

@rodrsouza I used the same command as you did, but I was running that on the A40 in the docker. Theoretically speaking, since you were able to run it successfully with DeepStream, trtexec should also be fine. You can monitor the loading of your host when running the trtexec. If it’s still not working, please refer to our FAQ and measure it using latency method.

1 Like

We successfully ran the test with trtexec command. Here is the complete output:
&&&& RUNNING TensorRT.trtexec [TensorRT v100300] # trtexec --loadEngine=deepstream/model/vale.engine

[07/29/2025-18:57:55] [I] === Model Options ===

[07/29/2025-18:57:55] [I] Format: *

[07/29/2025-18:57:55] [I] Model:

[07/29/2025-18:57:55] [I] Output:

[07/29/2025-18:57:55] [I]

[07/29/2025-18:57:55] [I] === System Options ===

[07/29/2025-18:57:55] [I] Device: 0

[07/29/2025-18:57:55] [I] DLACore:

[07/29/2025-18:57:55] [I] Plugins:

[07/29/2025-18:57:55] [I] setPluginsToSerialize:

[07/29/2025-18:57:55] [I] dynamicPlugins:

[07/29/2025-18:57:55] [I] ignoreParsedPluginLibs: 0

[07/29/2025-18:57:55] [I]

[07/29/2025-18:57:55] [I] === Inference Options ===

[07/29/2025-18:57:55] [I] Batch: Explicit

[07/29/2025-18:57:55] [I] Input inference shapes: model

[07/29/2025-18:57:55] [I] Iterations: 10

[07/29/2025-18:57:55] [I] Duration: 3s (+ 200ms warm up)

[07/29/2025-18:57:55] [I] Sleep time: 0ms

[07/29/2025-18:57:55] [I] Idle time: 0ms

[07/29/2025-18:57:55] [I] Inference Streams: 1

[07/29/2025-18:57:55] [I] ExposeDMA: Disabled

[07/29/2025-18:57:55] [I] Data transfers: Enabled

[07/29/2025-18:57:55] [I] Spin-wait: Disabled

[07/29/2025-18:57:55] [I] Multithreading: Disabled

[07/29/2025-18:57:55] [I] CUDA Graph: Disabled

[07/29/2025-18:57:55] [I] Separate profiling: Disabled

[07/29/2025-18:57:55] [I] Time Deserialize: Disabled

[07/29/2025-18:57:55] [I] Time Refit: Disabled

[07/29/2025-18:57:55] [I] NVTX verbosity: 0

[07/29/2025-18:57:55] [I] Persistent Cache Ratio: 0

[07/29/2025-18:57:55] [I] Optimization Profile Index: 0

[07/29/2025-18:57:55] [I] Weight Streaming Budget: 100.000000%

[07/29/2025-18:57:55] [I] Inputs:

[07/29/2025-18:57:55] [I] Debug Tensor Save Destinations:

[07/29/2025-18:57:55] [I] === Reporting Options ===

[07/29/2025-18:57:55] [I] Verbose: Disabled

[07/29/2025-18:57:55] [I] Averages: 10 inferences

[07/29/2025-18:57:55] [I] Percentiles: 90,95,99

[07/29/2025-18:57:55] [I] Dump refittable layers:Disabled

[07/29/2025-18:57:55] [I] Dump output: Disabled

[07/29/2025-18:57:55] [I] Profile: Disabled

[07/29/2025-18:57:55] [I] Export timing to JSON file:

[07/29/2025-18:57:55] [I] Export output to JSON file:

[07/29/2025-18:57:55] [I] Export profile to JSON file:

[07/29/2025-18:57:55] [I]

[07/29/2025-18:57:55] [I] === Device Information ===

[07/29/2025-18:57:55] [I] Available Devices:

[07/29/2025-18:57:55] [I] Device 0: “Tesla T4” UUID: GPU-e0a0c54c-f462-fe69-20f6-f33c1f9bfe6e

[07/29/2025-18:57:56] [I] Selected Device: Tesla T4

[07/29/2025-18:57:56] [I] Selected Device ID: 0

[07/29/2025-18:57:56] [I] Selected Device UUID: GPU-e0a0c54c-f462-fe69-20f6-f33c1f9bfe6e

[07/29/2025-18:57:56] [I] Compute Capability: 7.5

[07/29/2025-18:57:56] [I] SMs: 40

[07/29/2025-18:57:56] [I] Device Global Memory: 15935 MiB

[07/29/2025-18:57:56] [I] Shared Memory per SM: 64 KiB

[07/29/2025-18:57:56] [I] Memory Bus Width: 256 bits (ECC disabled)

[07/29/2025-18:57:56] [I] Application Compute Clock Rate: 1.59 GHz

[07/29/2025-18:57:56] [I] Application Memory Clock Rate: 5.001 GHz

[07/29/2025-18:57:56] [I]

[07/29/2025-18:57:56] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.

[07/29/2025-18:57:56] [I]

[07/29/2025-18:57:56] [I] TensorRT version: 10.3.0

[07/29/2025-18:57:56] [I] Loading standard plugins

[07/29/2025-18:57:56] [I] [TRT] Loaded engine size: 112 MiB

[07/29/2025-18:57:56] [W] [TRT] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.

[07/29/2025-18:57:56] [I] Engine deserialized in 0.0765873 sec.

[07/29/2025-18:57:56] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +77, now: CPU 0, GPU 189 (MiB)

[07/29/2025-18:57:56] [I] Setting persistentCacheLimit to 0 bytes.

[07/29/2025-18:57:56] [I] Created execution context with device memory size: 74.2188 MiB

[07/29/2025-18:57:56] [I] Using random values for input input

[07/29/2025-18:57:56] [I] Input binding for input with dimensions 1x3x640x640 is created.

[07/29/2025-18:57:56] [I] Output binding for output with dimensions 1x8400x6 is created.

[07/29/2025-18:57:56] [I] Starting inference

[07/29/2025-18:57:59] [I] Warmup completed 8 queries over 200 ms

[07/29/2025-18:57:59] [I] Timing trace has 130 queries over 3.05927 s

[07/29/2025-18:57:59] [I]

[07/29/2025-18:57:59] [I] === Trace details ===

[07/29/2025-18:57:59] [I] Trace averages of 10 runs:

[07/29/2025-18:57:59] [I] Average on 10 runs - GPU latency: 23.0918 ms - Host latency: 23.5008 ms (enqueue 1.69229 ms)

[07/29/2025-18:57:59] [I] Average on 10 runs - GPU latency: 23.1935 ms - Host latency: 23.6024 ms (enqueue 1.68199 ms)

[07/29/2025-18:57:59] [I] Average on 10 runs - GPU latency: 23.3682 ms - Host latency: 23.7773 ms (enqueue 1.65952 ms)

[07/29/2025-18:57:59] [I] Average on 10 runs - GPU latency: 23.9427 ms - Host latency: 24.352 ms (enqueue 1.65334 ms)

[07/29/2025-18:57:59] [I] Average on 10 runs - GPU latency: 23.0424 ms - Host latency: 23.4516 ms (enqueue 1.64347 ms)

[07/29/2025-18:57:59] [I] Average on 10 runs - GPU latency: 23.2945 ms - Host latency: 23.7028 ms (enqueue 1.65194 ms)

[07/29/2025-18:57:59] [I] Average on 10 runs - GPU latency: 23.1616 ms - Host latency: 23.5696 ms (enqueue 1.62484 ms)

[07/29/2025-18:57:59] [I] Average on 10 runs - GPU latency: 23.7048 ms - Host latency: 24.1139 ms (enqueue 1.63248 ms)

[07/29/2025-18:57:59] [I] Average on 10 runs - GPU latency: 23.2862 ms - Host latency: 23.6951 ms (enqueue 1.65913 ms)

[07/29/2025-18:57:59] [I] Average on 10 runs - GPU latency: 23.2471 ms - Host latency: 23.6559 ms (enqueue 1.65354 ms)

[07/29/2025-18:57:59] [I] Average on 10 runs - GPU latency: 23.2984 ms - Host latency: 23.7068 ms (enqueue 1.65906 ms)

[07/29/2025-18:57:59] [I] Average on 10 runs - GPU latency: 23.5356 ms - Host latency: 23.944 ms (enqueue 1.679 ms)

[07/29/2025-18:57:59] [I] Average on 10 runs - GPU latency: 23.3919 ms - Host latency: 23.7993 ms (enqueue 1.63882 ms)

[07/29/2025-18:57:59] [I]

[07/29/2025-18:57:59] [I] === Performance summary ===

[07/29/2025-18:57:59] [I] Throughput: 42.4937 qps

[07/29/2025-18:57:59] [I] Latency: min = 23.2473 ms, max = 25.9184 ms, mean = 23.7593 ms, median = 23.697 ms, percentile(90%) = 24.1882 ms, percentile(95%) = 24.3213 ms, percentile(99%) = 25.8605 ms

[07/29/2025-18:57:59] [I] Enqueue Time: min = 1.60962 ms, max = 1.89923 ms, mean = 1.65611 ms, median = 1.6371 ms, percentile(90%) = 1.72449 ms, percentile(95%) = 1.84314 ms, percentile(99%) = 1.89624 ms

[07/29/2025-18:57:59] [I] H2D Latency: min = 0.383667 ms, max = 0.389648 ms, mean = 0.385992 ms, median = 0.385986 ms, percentile(90%) = 0.388062 ms, percentile(95%) = 0.388428 ms, percentile(99%) = 0.389404 ms

[07/29/2025-18:57:59] [I] GPU Compute Time: min = 22.8389 ms, max = 25.5097 ms, mean = 23.3507 ms, median = 23.2883 ms, percentile(90%) = 23.7808 ms, percentile(95%) = 23.9138 ms, percentile(99%) = 25.4509 ms

[07/29/2025-18:57:59] [I] D2H Latency: min = 0.0195312 ms, max = 0.0248413 ms, mean = 0.0226805 ms, median = 0.0227051 ms, percentile(90%) = 0.0239258 ms, percentile(95%) = 0.0241699 ms, percentile(99%) = 0.0245361 ms

[07/29/2025-18:57:59] [I] Total Host Walltime: 3.05927 s

[07/29/2025-18:57:59] [I] Total GPU Compute Time: 3.03559 s

[07/29/2025-18:57:59] [W] * GPU compute time is unstable, with coefficient of variance = 1.87985%.

[07/29/2025-18:57:59] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.

[07/29/2025-18:57:59] [I] Explanations of the performance metrics are printed in the verbose logs.

[07/29/2025-18:57:59] [I]

&&&& PASSED TensorRT.trtexec [TensorRT v100300] # trtexec --loadEngine=deepstream/model/vale.engine

You can check that from your model perf log. The latency is about 23ms~25ms. The reason for the decrease in fps is the model you used.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.