DeepStream App Performance Drop with Custom YOLOv11 Model

We developed a simple Python application using DeepStream, based on the deepstream_test_1 example from the deepstream_python_apps repository. It worked flawlessly with the TrafficCamNet model (TrafficCamNet | NVIDIA NGC), delivering excellent performance at approximately 600 FPS.

We then modified the application to process video using our custom YOLOv11 model, following the guidance provided in this documentation: Ultralytics YOLO11 on NVIDIA Jetson using DeepStream SDK and TensorRT.

While the integration was successful, we observed a significant drop in performance—from 600 FPS down to around 40 FPS.

We would like to optimize our application to achieve better performance with our custom model. Below are our configuration files and pipeline setup, also our hardware specs. For testing, we are using an .MP4 video file.

deepstream-app version 7.1.0
DeepStreamSDK 7.1.0
CUDA Driver Version: 12.6
CUDA Runtime Version: 12.6
TensorRT Version: 10.3
cuDNN Version: 9.0
libNVWarp360 Version: 2.0.1d3
GPU: Tesla T4 - 16 GB

 pipeline = Gst.parse_launch(
        f"filesrc location={file_path} ! qtdemux name=demux "
        "demux.video_0 ! h264parse ! nvv4l2decoder ! nvvideoconvert ! "
        f"video/x-raw(memory:NVMM),format=NV12,width={width},height={height} ! "
        "queue ! mux.sink_0 "
        f"nvstreammux name=mux batch-size=1 width={width} height={height} live-source=0 batched-push-timeout=40000 ! "
        "nvinfer config-file-path=/opt/nvidia/deepstream/deepstream/sources/vale/config.txt ! "
        "nvdsosd name=osd ! nvvideoconvert ! "
        "video/x-raw(memory:NVMM),format=NV12 ! "
        "nvv4l2h264enc ! h264parse ! qtmux ! "
        f"filesink location=output.mp4"
    )
    osd = pipeline.get_by_name("osd")
    if osd:
        sinkpad = osd.get_static_pad("sink")
        if sinkpad:
            sinkpad.add_probe(Gst.PadProbeType.BUFFER, process_video, 0)

config_custom_model.txt (639 Bytes)
config_traffic_cam.txt (592 Bytes)

1 Like

Could you run the command below to check the perf of your model first?

trtexec --loadEngine=vale.engine --batch=1 --fp16

Hi there!

I’m working together with rodrsouza on this, and I just tried to run the command and at first I got the following error due to the --batch=1:

[07/25/2025-13:40:28] [E] Unknown option: --batch 1
&&&& FAILED TensorRT.trtexec [TensorRT v100300] # trtexec --loadEngine=vale.engine --batch=1 --fp16

But when I tried to use trtexec --loadEngine=vale.engine --fp16, I got the following output:

&&&& RUNNING TensorRT.trtexec [TensorRT v100300] # trtexec --loadEngine=vale.engine --fp16
[07/25/2025-13:22:10] [I] === Model Options ===
[07/25/2025-13:22:10] [I] Format: *
[07/25/2025-13:22:10] [I] Model: 
[07/25/2025-13:22:10] [I] Output:
[07/25/2025-13:22:10] [I] 
[07/25/2025-13:22:10] [I] === System Options ===
[07/25/2025-13:22:10] [I] Device: 0
[07/25/2025-13:22:10] [I] DLACore: 
[07/25/2025-13:22:10] [I] Plugins:
[07/25/2025-13:22:10] [I] setPluginsToSerialize:
[07/25/2025-13:22:10] [I] dynamicPlugins:
[07/25/2025-13:22:10] [I] ignoreParsedPluginLibs: 0
[07/25/2025-13:22:10] [I] 
[07/25/2025-13:22:10] [I] === Inference Options ===
[07/25/2025-13:22:10] [I] Batch: Explicit
[07/25/2025-13:22:10] [I] Input inference shapes: model
[07/25/2025-13:22:10] [I] Iterations: 10
[07/25/2025-13:22:10] [I] Duration: 3s (+ 200ms warm up)
[07/25/2025-13:22:10] [I] Sleep time: 0ms
[07/25/2025-13:22:10] [I] Idle time: 0ms
[07/25/2025-13:22:10] [I] Inference Streams: 1
[07/25/2025-13:22:10] [I] ExposeDMA: Disabled
[07/25/2025-13:22:10] [I] Data transfers: Enabled
[07/25/2025-13:22:10] [I] Spin-wait: Disabled
[07/25/2025-13:22:10] [I] Multithreading: Disabled
[07/25/2025-13:22:10] [I] CUDA Graph: Disabled
[07/25/2025-13:22:10] [I] Separate profiling: Disabled
[07/25/2025-13:22:10] [I] Time Deserialize: Disabled
[07/25/2025-13:22:10] [I] Time Refit: Disabled
[07/25/2025-13:22:10] [I] NVTX verbosity: 0
[07/25/2025-13:22:10] [I] Persistent Cache Ratio: 0
[07/25/2025-13:22:10] [I] Optimization Profile Index: 0
[07/25/2025-13:22:10] [I] Weight Streaming Budget: 100.000000%
[07/25/2025-13:22:10] [I] Inputs:
[07/25/2025-13:22:10] [I] Debug Tensor Save Destinations:
[07/25/2025-13:22:10] [I] === Reporting Options ===
[07/25/2025-13:22:10] [I] Verbose: Disabled
[07/25/2025-13:22:10] [I] Averages: 10 inferences
[07/25/2025-13:22:10] [I] Percentiles: 90,95,99
[07/25/2025-13:22:10] [I] Dump refittable layers:Disabled
[07/25/2025-13:22:10] [I] Dump output: Disabled
[07/25/2025-13:22:10] [I] Profile: Disabled
[07/25/2025-13:22:10] [I] Export timing to JSON file: 
[07/25/2025-13:22:10] [I] Export output to JSON file: 
[07/25/2025-13:22:10] [I] Export profile to JSON file: 
[07/25/2025-13:22:10] [I] 
[07/25/2025-13:22:10] [I] === Device Information ===
[07/25/2025-13:22:11] [I] Available Devices: 
[07/25/2025-13:22:11] [I]   Device 0: "Tesla T4" UUID: GPU-bbd80375-ec3c-7309-6621-4f2970f1d139
[07/25/2025-13:22:11] [I] Selected Device: Tesla T4
[07/25/2025-13:22:11] [I] Selected Device ID: 0
[07/25/2025-13:22:11] [I] Selected Device UUID: GPU-bbd80375-ec3c-7309-6621-4f2970f1d139
[07/25/2025-13:22:11] [I] Compute Capability: 7.5
[07/25/2025-13:22:11] [I] SMs: 40
[07/25/2025-13:22:11] [I] Device Global Memory: 14930 MiB
[07/25/2025-13:22:11] [I] Shared Memory per SM: 64 KiB
[07/25/2025-13:22:11] [I] Memory Bus Width: 256 bits (ECC enabled)
[07/25/2025-13:22:11] [I] Application Compute Clock Rate: 1.59 GHz
[07/25/2025-13:22:11] [I] Application Memory Clock Rate: 5.001 GHz
[07/25/2025-13:22:11] [I] 
[07/25/2025-13:22:11] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[07/25/2025-13:22:11] [I] 
[07/25/2025-13:22:11] [I] TensorRT version: 10.3.0
[07/25/2025-13:22:11] [I] Loading standard plugins
LLVM ERROR: out of memory