DeepStream App Performance Drop with Custom YOLOv11 Model

rodrsouza · July 24, 2025, 5:45pm

We developed a simple Python application using DeepStream, based on the deepstream_test_1 example from the deepstream_python_apps repository. It worked flawlessly with the TrafficCamNet model (TrafficCamNet | NVIDIA NGC), delivering excellent performance at approximately 600 FPS.

We then modified the application to process video using our custom YOLOv11 model, following the guidance provided in this documentation: Ultralytics YOLO11 on NVIDIA Jetson using DeepStream SDK and TensorRT.

While the integration was successful, we observed a significant drop in performance—from 600 FPS down to around 40 FPS.

We would like to optimize our application to achieve better performance with our custom model. Below are our configuration files and pipeline setup, also our hardware specs. For testing, we are using an .MP4 video file.

deepstream-app version 7.1.0
DeepStreamSDK 7.1.0
CUDA Driver Version: 12.6
CUDA Runtime Version: 12.6
TensorRT Version: 10.3
cuDNN Version: 9.0
libNVWarp360 Version: 2.0.1d3
GPU: Tesla T4 - 16 GB

 pipeline = Gst.parse_launch(
        f"filesrc location={file_path} ! qtdemux name=demux "
        "demux.video_0 ! h264parse ! nvv4l2decoder ! nvvideoconvert ! "
        f"video/x-raw(memory:NVMM),format=NV12,width={width},height={height} ! "
        "queue ! mux.sink_0 "
        f"nvstreammux name=mux batch-size=1 width={width} height={height} live-source=0 batched-push-timeout=40000 ! "
        "nvinfer config-file-path=/opt/nvidia/deepstream/deepstream/sources/vale/config.txt ! "
        "nvdsosd name=osd ! nvvideoconvert ! "
        "video/x-raw(memory:NVMM),format=NV12 ! "
        "nvv4l2h264enc ! h264parse ! qtmux ! "
        f"filesink location=output.mp4"
    )
    osd = pipeline.get_by_name("osd")
    if osd:
        sinkpad = osd.get_static_pad("sink")
        if sinkpad:
            sinkpad.add_probe(Gst.PadProbeType.BUFFER, process_video, 0)

config_custom_model.txt (639 Bytes)
config_traffic_cam.txt (592 Bytes)

yuweiw · July 25, 2025, 3:11am

Could you run the command below to check the perf of your model first?

trtexec --loadEngine=vale.engine --batch=1 --fp16

lucas.mirachi · July 25, 2025, 1:57pm

Hi there!

I’m working together with rodrsouza on this, and I just tried to run the command and at first I got the following error due to the --batch=1:

[07/25/2025-13:40:28] [E] Unknown option: --batch 1
&&&& FAILED TensorRT.trtexec [TensorRT v100300] # trtexec --loadEngine=vale.engine --batch=1 --fp16

But when I tried to use trtexec --loadEngine=vale.engine --fp16, I got the following output:

&&&& RUNNING TensorRT.trtexec [TensorRT v100300] # trtexec --loadEngine=vale.engine --fp16
[07/25/2025-13:22:10] [I] === Model Options ===
[07/25/2025-13:22:10] [I] Format: *
[07/25/2025-13:22:10] [I] Model: 
[07/25/2025-13:22:10] [I] Output:
[07/25/2025-13:22:10] [I] 
[07/25/2025-13:22:10] [I] === System Options ===
[07/25/2025-13:22:10] [I] Device: 0
[07/25/2025-13:22:10] [I] DLACore: 
[07/25/2025-13:22:10] [I] Plugins:
[07/25/2025-13:22:10] [I] setPluginsToSerialize:
[07/25/2025-13:22:10] [I] dynamicPlugins:
[07/25/2025-13:22:10] [I] ignoreParsedPluginLibs: 0
[07/25/2025-13:22:10] [I] 
[07/25/2025-13:22:10] [I] === Inference Options ===
[07/25/2025-13:22:10] [I] Batch: Explicit
[07/25/2025-13:22:10] [I] Input inference shapes: model
[07/25/2025-13:22:10] [I] Iterations: 10
[07/25/2025-13:22:10] [I] Duration: 3s (+ 200ms warm up)
[07/25/2025-13:22:10] [I] Sleep time: 0ms
[07/25/2025-13:22:10] [I] Idle time: 0ms
[07/25/2025-13:22:10] [I] Inference Streams: 1
[07/25/2025-13:22:10] [I] ExposeDMA: Disabled
[07/25/2025-13:22:10] [I] Data transfers: Enabled
[07/25/2025-13:22:10] [I] Spin-wait: Disabled
[07/25/2025-13:22:10] [I] Multithreading: Disabled
[07/25/2025-13:22:10] [I] CUDA Graph: Disabled
[07/25/2025-13:22:10] [I] Separate profiling: Disabled
[07/25/2025-13:22:10] [I] Time Deserialize: Disabled
[07/25/2025-13:22:10] [I] Time Refit: Disabled
[07/25/2025-13:22:10] [I] NVTX verbosity: 0
[07/25/2025-13:22:10] [I] Persistent Cache Ratio: 0
[07/25/2025-13:22:10] [I] Optimization Profile Index: 0
[07/25/2025-13:22:10] [I] Weight Streaming Budget: 100.000000%
[07/25/2025-13:22:10] [I] Inputs:
[07/25/2025-13:22:10] [I] Debug Tensor Save Destinations:
[07/25/2025-13:22:10] [I] === Reporting Options ===
[07/25/2025-13:22:10] [I] Verbose: Disabled
[07/25/2025-13:22:10] [I] Averages: 10 inferences
[07/25/2025-13:22:10] [I] Percentiles: 90,95,99
[07/25/2025-13:22:10] [I] Dump refittable layers:Disabled
[07/25/2025-13:22:10] [I] Dump output: Disabled
[07/25/2025-13:22:10] [I] Profile: Disabled
[07/25/2025-13:22:10] [I] Export timing to JSON file: 
[07/25/2025-13:22:10] [I] Export output to JSON file: 
[07/25/2025-13:22:10] [I] Export profile to JSON file: 
[07/25/2025-13:22:10] [I] 
[07/25/2025-13:22:10] [I] === Device Information ===
[07/25/2025-13:22:11] [I] Available Devices: 
[07/25/2025-13:22:11] [I]   Device 0: "Tesla T4" UUID: GPU-bbd80375-ec3c-7309-6621-4f2970f1d139
[07/25/2025-13:22:11] [I] Selected Device: Tesla T4
[07/25/2025-13:22:11] [I] Selected Device ID: 0
[07/25/2025-13:22:11] [I] Selected Device UUID: GPU-bbd80375-ec3c-7309-6621-4f2970f1d139
[07/25/2025-13:22:11] [I] Compute Capability: 7.5
[07/25/2025-13:22:11] [I] SMs: 40
[07/25/2025-13:22:11] [I] Device Global Memory: 14930 MiB
[07/25/2025-13:22:11] [I] Shared Memory per SM: 64 KiB
[07/25/2025-13:22:11] [I] Memory Bus Width: 256 bits (ECC enabled)
[07/25/2025-13:22:11] [I] Application Compute Clock Rate: 1.59 GHz
[07/25/2025-13:22:11] [I] Application Memory Clock Rate: 5.001 GHz
[07/25/2025-13:22:11] [I] 
[07/25/2025-13:22:11] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[07/25/2025-13:22:11] [I] 
[07/25/2025-13:22:11] [I] TensorRT version: 10.3.0
[07/25/2025-13:22:11] [I] Loading standard plugins
LLVM ERROR: out of memory

Topic		Replies	Views
How to use custom yolov4-tiny in deepstream6 with rtsp output? DeepStream SDK	8	996	November 25, 2022
Deepstream 6.0 Python Yolo bad performance DeepStream SDK	8	1680	December 28, 2021
Help with implementing Custom trained Yolo model for inference DeepStream SDK opencv , gstreamer , yolo , deepstream	2	600	November 23, 2022
Custom YoloV4 Tiny Model with DeepStream DeepStream SDK tensorrt , yolo , onnx	2	1263	October 12, 2021
Develop a Python AI app using YOLOv3 and deepstream DeepStream SDK	10	547	October 12, 2021
Custom YOLOv4 Model Performance DeepStream SDK deepstream	2	633	April 18, 2022
Implement custom yolo or any custom model in Deepstream using Python DeepStream SDK	4	1444	November 24, 2022
Improved DeepStream for YOLO models DeepStream SDK	9	2333	March 25, 2022
Yolo for deepstream-app DeepStream SDK	27	9142	October 12, 2021
Unable to use yolov5s in deepstream pipeline DeepStream SDK tensorrt , cuda , gstreamer , yolo , deepstream	6	1129	September 14, 2022

DeepStream App Performance Drop with Custom YOLOv11 Model

Related topics