Huge delay performance between python script and deepstream-app

Please provide complete information as applicable to your setup.
• Hardware Platform (Jetson / GPU) Jetson Xavier AGX
• DeepStream Version 6.0.1
• JetPack Version (valid for Jetson only) 4.6
• TensorRT Version 8.2.1

I am facing a huge delay between deepstream-app and the sample from deepstream_python_apps/apps/deepstream-test3/ at master · NVIDIA-AI-IOT/deepstream_python_apps (, when consuming the same RTSP stream, although the processing frame rate is almost the same.

Python script takes some time to reach “print(“Starting pipeline \n”)”, line 388, almost 15 seconds. And this elapsed time looks like the same delay I am facing, just like there is a buffer processing old frames like a FIFO data structure.

Does any one has a tip how to avoid this delay?

Do you mean that model conversion takes time? Conversion will only occur the first time you run the sample

No, I don’t. The delay occurs at inference time.

Using deepstream-app, the image from video with the inference bounding box is the (almost) same image from video produced by an RTSP stream, that is:
Tdeepstreamm-app ~ Tstream (all running at 24fps)

But when using the python script, the image from video with the inference bounding box is 15 seconds delayed from image from video produced by an RTSP stream, that is:
Tpython = Tstream + 15s (all running at 24fps)

And this 15s seems to be the time consumed by the script to reach line 388 (print(“Starting pipeline \n”)), which might be just a coincidence.

I’m not sure how you measured it, I think the patch below is a correct solution

t0 == start time
t1 == pipeline start success time
t2 == First frame output time

diff --git a/apps/deepstream-test3/ b/apps/deepstream-test3/
index 75a64d5..6a0538f 100755
--- a/apps/deepstream-test3/
+++ b/apps/deepstream-test3/
@@ -41,6 +41,8 @@ silent = False
 file_loop = False
 perf_data = None

+time_measure = False
@@ -105,6 +107,11 @@ def pgie_src_pad_buffer_probe(pad,info,u_data):
         if not silent:
             print("Frame Number=", frame_number, "Number of Objects=",num_rects,"Vehicle_count=",obj_counter[PGIE_CLASS_ID_VEHICLE],"Person_count=",obj_counter[PGIE_CLASS_ID_PERSON])

+        global time_measure
+        if not time_measure:
+            time_measure = True
+            print(f"t2 {time.time()}")
         # Update frame rate through this probe
         stream_index = "stream{0}".format(frame_meta.pad_index)
         global perf_data
@@ -210,7 +217,7 @@ def main(args, requested_pgie=None, config=None, disable_probe=False):

     # Create gstreamer elements */
     # Create Pipeline element that will form a connection of other elements
-    print("Creating Pipeline \n ")
+    print(f"Creating Pipeline t0 {time.time()}\n ")
     pipeline = Gst.Pipeline()
     is_live = False

@@ -301,7 +308,8 @@ def main(args, requested_pgie=None, config=None, disable_probe=False):
         if is_aarch64():
             print("Creating nv3dsink \n")
-            sink = Gst.ElementFactory.make("nv3dsink", "nv3d-sink")
+            # sink = Gst.ElementFactory.make("nv3dsink", "nv3d-sink")
+            sink = Gst.ElementFactory.make("fakesink", "nv3d-sink")
             if not sink:
                 sys.stderr.write(" Unable to create nv3dsink \n")
@@ -385,7 +393,7 @@ def main(args, requested_pgie=None, config=None, disable_probe=False):
     for i, source in enumerate(args):
         print(i, ": ", source)

-    print("Starting pipeline \n")
+    print(f"Starting pipeline t1 {time.time()}\n")
     # start play back and listed to events

Since you use rtsp as input, can you ensure that the input video sequences of deepstream-app and are exactly the same? What is the IDR interval of your rtsp stream ?

This is the measure did:
Creating Pipeline t0 1708000703.2639432
Starting pipeline t1 1708000703.4469986
t2 1708000716.5142927

As you can see the time between starting the pipeline (t1) and the first buffer probe call (t2) is about 13s. However, I think it can take how much time it necessary to load it (1s, 13s, or either 1 hour), as far as the inference reacts immediately to an event. For instance, if I raise my hand in from of the camera, the pipeline must produce an image for buffer probe almost immediately. It can’t take 13s between I raise my hand and the production of the output image showing my hand raised.

It seems like there is some kind of buffer that is receiving images from the stream, and it starts receiving then when the pipeline start is triggered, but the pipeline is not consuming those images yet. When it starts consuming and making inference, it get the oldest image, just like a FIFO data structure. And the size of this FIFO is about 13s of images.

This time is used to load the model and run inference.
Both python and deepstream-app are necessary.
Doesn’t result in a huge performance difference.

You can set the sync property of nv3dsink to false, which will output the buffer data as soon as possible.

Doesn’t result in a huge performance difference.
I agree.

Changing the sync property to false made any difference, if I raise my hand in from of the camera, I still need to wait about 13s to see my hand raised at the output stream. I also tested two issues:

  1. Using VLC, I connected directly to the network stream produced by the camera, just in case the camera might be introducing the delay, but it is not. When I raise my hand, it is immediately seen at the stream.

  2. Following your tip about sync, I this post explaining the property at udpsync I changed the code to support udpsync (dstest1_pgie_nvinfer_config.txt (3.1 KB) and (19.0 KB) ) but the delay is the same.

So, I decided to make a big simplification on pipeline in order to isolated the delay problem. The original pipeline was uridecodebin -> nvstreammux -> nvinfer -> nvtracker -> -> nvvideoconvert -> nvdsosd -> nvvideoconvert -> capsfilter -> nvv4l2h264enc -> rtph264pay -> udpsink. Now my pipeline is:

uridecodebin -> nvstreammux ->  nvvideoconvert -> capsfilter -> nvv4l2h264enc -> rtph264pay -> udpsink

It just gets the source stream and repost into another stream. The delay is still there. This is the simplified code:

#python3 -i rtsp://admin:hbyt12345@
#Creates stream RTSP em rtsp://localhost:8554/ds-test

import sys
from common.bus_call import bus_call
from common.is_aarch_64 import is_aarch64
from common.FPS import GETFPS
import pyds
import platform
import math
import time
from ctypes import *
import gi
gi.require_version("Gst", "1.0")
gi.require_version("GstRtspServer", "1.0")
from gi.repository import Gst, GstRtspServer, GLib
import configparser
import argparse



def make_elm_or_print_err(factoryname, humanname, detail=""):
  print("Creating", humanname)
  elm = Gst.ElementFactory.make(factoryname, humanname)
  if not elm:
     sys.stderr.write("Unable to create " + humanname + " \n")
  if detail:
  return elm

def cb_newpad(decodebin, decoder_src_pad, data):
    print("In cb_newpad\n")
    caps = decoder_src_pad.get_current_caps()
    gststruct = caps.get_structure(0)
    gstname = gststruct.get_name()
    source_bin = data
    features = caps.get_features(0)

    print("gstname=", gstname)
    if gstname.find("video") != -1:
        print("features=", features)
        if features.contains("memory:NVMM"):
            # Get the source bin ghost pad
            bin_ghost_pad = source_bin.get_static_pad("src")
            if not bin_ghost_pad.set_target(decoder_src_pad):
                    "Failed to link decoder src pad to source bin ghost pad\n"
                " Error: Decodebin did not pick nvidia decoder plugin.\n")

def decodebin_child_added(child_proxy, Object, name, user_data):
    print("Decodebin child added:", name, "\n")
    if name.find("decodebin") != -1:
        Object.connect("child-added", decodebin_child_added, user_data)

def create_source_bin(index, uri):
    bin_name = "source-bin-%02d" % index
    print("Creating source bin", bin_name, "(", uri, ")")
    nbin =
    if not nbin:
        sys.stderr.write("\tUnable to create source bin ", bin_name, "\n")

    uri_decode_bin = make_elm_or_print_err("uridecodebin", "uri-decode-bin")
    uri_decode_bin.set_property("uri", uri)

    uri_decode_bin.connect("pad-added", cb_newpad, nbin)
    uri_decode_bin.connect("child-added", decodebin_child_added, nbin)

    Gst.Bin.add(nbin, uri_decode_bin)
    bin_pad = nbin.add_pad(
        Gst.GhostPad.new_no_target("src", Gst.PadDirection.SRC))
    if not bin_pad:
        sys.stderr.write("\tFailed to add ghost pad in source bin ", bin_name, "\n")
        return None
    return nbin

def main(args):
    # Check input arguments
    number_sources = len(args)

    for i in range(0,number_sources):


    print(f"Creating Pipeline\n ")
    pipeline = Gst.Pipeline()
    is_rtsp = False
    if not pipeline:
        sys.stderr.write("\tUnable to create Pipeline \n")


    # Create nvstreammux instance to form batches from one or more sources.
    streammux = make_elm_or_print_err("nvstreammux", "stream-muxer")
    streammux.set_property("width", MUXER_OUTPUT_WIDTH)
    streammux.set_property("height", MUXER_OUTPUT_HEIGHT)
    streammux.set_property("batch-size", 1)
    streammux.set_property("batched-push-timeout", MUXER_BATCH_TIMEOUT_USEC)

    # Create source element for reading
    for i in range(number_sources):
        uri_name = args[i]
        if uri_name.find("rtsp://") == 0:
            is_rtsp = True
        source_bin = create_source_bin(i, uri_name)
        if not source_bin:
            sys.stderr.write("\tUnable to create source bin \n")
        padname = "sink_%u" % i
        sinkpad = streammux.get_request_pad(padname)
        if not sinkpad:
            sys.stderr.write("\tUnable to create sink pad bin \n")
        srcpad = source_bin.get_static_pad("src")
        if not srcpad:
            sys.stderr.write("\tUnable to create src pad bin \n")

    # Use convertor to convert from NV12 to RGBA 
    nvvidconv_postosd = make_elm_or_print_err("nvvideoconvert", "convertor_postosd")

    # Create a caps filter
    caps = make_elm_or_print_err("capsfilter", "capsfilter")
    caps.set_property("caps", Gst.Caps.from_string("video/x-raw(memory:NVMM), format=I420"))

    # Make the encoder
    if codec == "H264":
        encoder = make_elm_or_print_err("nvv4l2h264enc", "h264-encoder")
    elif codec == "H265":
        encoder = make_elm_or_print_err("nvv4l2h265enc", "h265-encoder")
    encoder.set_property("bitrate", bitrate)
    if is_aarch64():
        encoder.set_property("preset-level", 1)
        encoder.set_property("insert-sps-pps", 1)
        #encoder.set_property("bufapi-version", 1)

    # Make the payload-encode video into RTP packets
    if codec == "H264":
        rtppay = make_elm_or_print_err("rtph264pay", "rtp-h264-payload")
    elif codec == "H265":
        rtppay = make_elm_or_print_err("rtph265pay", "rtp-h265-payload")

    # Make the UDP sink
    updsink_port_num = 5400
    sink = make_elm_or_print_err("udpsink", "udp-sink")
    sink.set_property("host", "")
    sink.set_property("port", updsink_port_num)
    sink.set_property("async", False)
    sink.set_property("sync", 1)
    sink.set_property("qos", 0)


    print("Adding elements to Pipeline")

    print("Linking elements in the Pipeline")

    # create an event loop and feed gstreamer bus mesages to it
    loop = GLib.MainLoop()
    bus = pipeline.get_bus()
    bus.connect("message", bus_call, loop)

    # Start streaming
    rtsp_port_num = 8554

    server =
    server.props.service = "%d" % rtsp_port_num

    factory =
        '( udpsrc name=pay0 port=%d buffer-size=524288 caps="application/x-rtp, media=video, clock-rate=90000, encoding-name=(string)%s, payload=96 " )'
        % (updsink_port_num, codec)
    server.get_mount_points().add_factory("/ds-test", factory)

        "\n *** DeepStream: Launched RTSP Streaming at rtsp://localhost:%d/ds-test ***\n\n"
        % rtsp_port_num

    #Now with everything defined , we can start the playback and listen the events.
    print(f"Starting pipeline t1 {time.time()}\n")
    start_time = time.time()
    except BaseException:
    print("Exiting app\n")
    # cleanup
    print("--- %s seconds ---" % (time.time() - start_time))

# Parse the configuration arguments for runing the script
def parse_args():
    parser = argparse.ArgumentParser(description='AI2 Main Inference Pipeline ')
    parser.add_argument("-i", "--input",
                  help="Path to input elementry stream", nargs="+", default=["a"], required=True)
    parser.add_argument("-g", "--gie", default="nvinfer",
                  help="Choose GPU inference engine type nvinfer or nvinferserver , default=nvinfer", choices=['nvinfer','nvinferserver'])
    parser.add_argument("-c", "--codec", default="H264",
                  help="RTSP Streaming Codec H264/H265 , default=H264", choices=['H264','H265'])
    parser.add_argument("-b", "--bitrate", default=4000000,
                  help="Set the encoding bitrate , default=4000000", type=int)
    # Check input arguments
    if len(sys.argv)==1:
    args = parser.parse_args()
    global codec
    global bitrate
    global stream_path
    global gie
    gie = args.gie
    codec = args.codec
    bitrate = args.bitrate
    stream_path = args.input
    return stream_path

if __name__ == '__main__':
    stream_path = parse_args()

This is another problem, not the delay caused by

In fact, encoding, udp multicast and gstreamer-rtspserver all bring delays, which is a known issue. deepstream is only responsible for decoding, data processing and inference

For accurate delay measurements, it is best to use nv3dsink and display them directly on the screen.

So, by now this is the reduced pipeline:
uridecodebin -> nvstreammux -> nv3dsink

If I raise my hand in from of the camera, I still need to wait about 6s (before it was 13s) to see my hand raised at the output stream, which is too much. And by the way, in Jetson Nano dev-kit it takes 15s.

I suspect about uridecodebin.

uridecobin is not the offender, its nvstreammux. It is buffering the source’s images. The solution is set the live-source to True, that is:
streammux.set_property("live-source", True)

Solved, but why? Gst-nvstreammux — DeepStream documentation 6.4 documentation reports:

Set the live-source property to true to inform the muxer that the sources are live. In this case the muxer attaches the PTS of the last copied input buffer to the batched Gst Buffer’s PTS. If the property is set to false, the muxer calculates timestamps based on the frame rate of the source which first negotiates capabilities with the muxer.

What I do not understand is the reason for he existence of this buffer.

Decoding and inference are parallel. In order to completely process local stream file, buffering is added.

In fact, new streammux deprecated this property, you can refer to this document