Fastest way to render h264 stream?

I have a UDP video stream in h.264 format, and I need to display it on the screen with minimal delay on a desktop computer running Ubuntu. I used the following command:

gst-launch-1.0 -v udpsrc port=5600 ! application/x-rtp ! rtph264depay ! avdec_h264 skip-frame=5 ! autovideosink
Everything works fine with a delay of approximately 100 milliseconds.

I wanted to reduce the delay and utilize GPU acceleration from RTX 3070. For this, I installed the DeepStream SDK and used the command:
gst-launch-1.0 -v udpsrc port=5600 ! application/x-rtp ! rtph264depay ! h264parse ! nvh264dec ! capsfilter caps="video/x-raw(memory:GLMemory)" ! glsinkbin sync=false
However, as a result, I obtained a delay twice as much as before, around 200 milliseconds.

I also checked how the same stream would be displayed on Jetson Nano. The command:
gst-launch-1.0 -v udpsrc port=5600 ! application/x-rtp, payload=96 ! rtph264depay ! h264parse ! avdec_h264 ! nvvidconv ! nvoverlaysink
Provided a minimum delay of around 60 milliseconds.

Why is the delay on the less powerful Jetson Nano 4 times less than on the desktop RTX 3070, and is it possible to achieve such values on a desktop computer?

  1. The network latency is not avoidable since UDP is not guaranteed even when you use localhost. Have you measured the network latency separately?
  2. The h264 encoded stream can only be played from IDR frame.
  3. We don’t know how did you measure the latency between your source and the display.

The UDP video source was an IP camera. One and the same and with the same settings for all measurements. All measurements were within the local network. Measurement method: Screen capture of live video using the camera that was the source of the video. Thus, the measured latency contains everything including network delays. But since the minimum latency on the Jetson nano was 60 milliseconds, the network latency should be in the region of ±30 milliseconds, but sure not more than 60 ms.

Please use network tool to measure the UDP package latency. And please pay attention to the point 2 I mentioned, the h264 encoded stream can only be played from IDR frame, so you need to use correct measurement method.

What does my question have to do with network latency? I am sure the problem is not in the network. I don’t care about latency from the network. I’m interested in the latency from decoding H.264 data after the data has already been received. I’m interested in why avdec_h264 works better than nvh264dec on the same machine under the same conditions, and is it possible to do something about it? Do you mean that these two decoders handle IDR frames so differently that it can describe the differences in latency?

In the parameter description
gst-inspect-1.0 nvh264dec
I don’t see anything that could affect the processing of IDR frame.

My guess is that the frame is copied too many times from the GPU memory to the CPU memory and does some extra processing before getting to the screen. But I don’t know how to shorten this path.

From your description, what you measured is not the decoding time.

Certainly. What I measured includes absolutely everything. And the network delay, and the decoding delay, and the delay in displaying the frame on the screen, and any whims of the operating system that arose in the process. And even the entire pipeline on the IP camera side for encoding and data transmission, but this part is no more than 40ms.

An easy way is to measure the elements input and output time by probe functions. This may help you to identify the real latency time for every element.

And the Nvidia HW accelerated H264 video decoder is nvv4l2decoder. Please refer to Gst-nvvideo4linux2 — DeepStream documentation 6.4 documentation

You can also use “gst-inspect-1.0 nvv4l2decoder” to get the related information.

With RTX3070, the display sink should be “nveglglessink”. So the pipeline should be
gst-launch-1.0 -v udpsrc port=5600 ! application/x-rtp ! rtph264depay ! h264parse ! nvv4l2decoder ! nveglglessink sync=false

This pipeline has a latency of about 200-230 ms. When I added ‘nvv4l2decoder low-latency-mode=true’ the result decreased to 100 ms, which is the same as avdec_h264. I am reading about probe functions to measure each part of the pipeline separately. It takes some time…

So, I wrote a simple Python code to measure a pipeline like this:

import gi
import time
gi.require_version('Gst', '1.0')
from gi.repository import Gst, GObject, GLib

class TimeMeasure:
    render_name = "eglglessink0"
    def __init__(self):
        self.start_time = time.time()
        self.full_frame_start = time.time()
        self.name = "none"

    def make(self, element_name):
        now = time.time()
        diff = now - self.start_time
        if self.name == TimeMeasure.render_name:
            diff2 = now - self.full_frame_start
            print(f"{self.name} time: {diff*1000:.3f} ms, full time: {diff2*1000:.3f}\n")
            self.full_frame_start = now
        else:
            print(f"{self.name} time: {diff * 1000:.3f} ms")
        self.start_time = now
        self.name = element_name

def probe_time(pad, info, user_data):
    measure, name = user_data
    measure.make(name)
    return Gst.PadProbeReturn.OK

def main():
    Gst.init(None)
    loop = GLib.MainLoop()

    time_measure = TimeMeasure()

    pipeline_string = "udpsrc port=5600 ! application/x-rtp ! rtph264depay ! h264parse ! nvv4l2decoder low-latency-mode=true ! nveglglessink sync=false max-lateness=10000"
    # pipeline_string = "udpsrc port=5600 ! application/x-rtp ! rtph264depay ! h264parse ! avdec_h264 ! videoconvert ! autovideosink"
    pipeline = Gst.parse_launch(pipeline_string)
    for element in pipeline.children:
        element_name = element.get_name()
        pad = element.get_static_pad("sink")
        if pad is None:
            pad = element.get_static_pad("src")
        pad.add_probe(Gst.PadProbeType.BUFFER, probe_time, [time_measure, element_name])

    pipeline.set_state(Gst.State.PLAYING)

    try:
        loop.run()
    except KeyboardInterrupt:
        pass
    finally:
        pipeline.set_state(Gst.State.NULL)

if __name__ == "__main__":
    main()

And receive timing:
udpsrc port=5600 ! application/x-rtp ! rtph264depay ! h264parse ! nvv4l2decoder low-latency-mode=true ! nveglglessink sync=false max-lateness=10000

udpsrc0 time: 0.030 ms
capsfilter0 time: 0.017 ms
rtph264depay0 time: 0.756 ms
udpsrc0 time: 0.017 ms
capsfilter0 time: 0.012 ms
rtph264depay0 time: 0.593 ms
udpsrc0 time: 0.009 ms
capsfilter0 time: 0.007 ms
rtph264depay0 time: 0.019 ms
h264parse0 time: 0.030 ms
nvv4l2decoder0 time: 1.021 ms
eglglessink0 time: 17.537 ms, full time: 20.048

udpsrc0 time: 0.069 ms
capsfilter0 time: 0.045 ms
rtph264depay0 time: 0.628 ms
udpsrc0 time: 0.031 ms
capsfilter0 time: 0.020 ms
rtph264depay0 time: 0.538 ms
udpsrc0 time: 0.014 ms
capsfilter0 time: 0.012 ms
rtph264depay0 time: 0.032 ms
h264parse0 time: 0.035 ms
nvv4l2decoder0 time: 1.789 ms
eglglessink0 time: 16.677 ms, full time: 19.891

udpsrc0 time: 0.038 ms
capsfilter0 time: 0.018 ms
rtph264depay0 time: 0.754 ms
udpsrc0 time: 0.018 ms
capsfilter0 time: 0.011 ms
rtph264depay0 time: 0.513 ms
udpsrc0 time: 0.010 ms
capsfilter0 time: 0.007 ms
rtph264depay0 time: 0.020 ms
h264parse0 time: 0.031 ms
nvv4l2decoder0 time: 1.022 ms
eglglessink0 time: 19.916 ms, full time: 22.358

udpsrc0 time: 0.031 ms
capsfilter0 time: 0.017 ms
rtph264depay0 time: 0.753 ms
udpsrc0 time: 0.012 ms
capsfilter0 time: 0.009 ms
rtph264depay0 time: 0.594 ms
udpsrc0 time: 0.009 ms
capsfilter0 time: 0.007 ms
rtph264depay0 time: 0.020 ms
h264parse0 time: 0.031 ms
nvv4l2decoder0 time: 1.869 ms
eglglessink0 time: 14.774 ms, full time: 18.128

udpsrc0 time: 0.037 ms
capsfilter0 time: 0.018 ms
rtph264depay0 time: 0.744 ms
udpsrc0 time: 0.017 ms
capsfilter0 time: 0.012 ms
rtph264depay0 time: 0.602 ms
udpsrc0 time: 0.010 ms
capsfilter0 time: 0.008 ms
rtph264depay0 time: 0.020 ms
h264parse0 time: 0.031 ms
nvv4l2decoder0 time: 0.966 ms
eglglessink0 time: 17.170 ms, full time: 19.635

udpsrc0 time: 0.029 ms
capsfilter0 time: 0.017 ms
rtph264depay0 time: 0.750 ms
udpsrc0 time: 0.014 ms
capsfilter0 time: 0.010 ms
rtph264depay0 time: 0.618 ms
udpsrc0 time: 0.011 ms
capsfilter0 time: 0.009 ms
rtph264depay0 time: 0.030 ms
h264parse0 time: 0.034 ms
nvv4l2decoder0 time: 0.993 ms
eglglessink0 time: 18.898 ms, full time: 21.415
udpsrc port=5600 ! application/x-rtp ! rtph264depay ! h264parse ! avdec_h264 ! videoconvert ! autovideosink
udpsrc0 time: 0.027 ms
capsfilter0 time: 0.013 ms
rtph264depay0 time: 0.766 ms
udpsrc0 time: 0.014 ms
capsfilter0 time: 0.008 ms
rtph264depay0 time: 0.670 ms
udpsrc0 time: 0.012 ms
capsfilter0 time: 0.008 ms
rtph264depay0 time: 0.021 ms
h264parse0 time: 0.023 ms
avdec_h264-0 time: 19.234 ms
udpsrc0 time: 0.036 ms
capsfilter0 time: 0.015 ms
rtph264depay0 time: 0.744 ms
udpsrc0 time: 0.013 ms
capsfilter0 time: 0.008 ms
rtph264depay0 time: 0.600 ms
udpsrc0 time: 0.013 ms
capsfilter0 time: 0.008 ms
rtph264depay0 time: 0.021 ms
h264parse0 time: 0.024 ms
avdec_h264-0 time: 4.805 ms
videoconvert0 time: 0.032 ms
autovideosink0 time: 12.437 ms, full time: 39.550

udpsrc0 time: 0.039 ms
capsfilter0 time: 0.025 ms
rtph264depay0 time: 0.721 ms
udpsrc0 time: 0.018 ms
capsfilter0 time: 0.012 ms
rtph264depay0 time: 0.611 ms
udpsrc0 time: 0.017 ms
capsfilter0 time: 0.009 ms
rtph264depay0 time: 0.021 ms
h264parse0 time: 0.024 ms
avdec_h264-0 time: 18.365 ms
udpsrc0 time: 0.042 ms
capsfilter0 time: 0.019 ms
rtph264depay0 time: 0.722 ms
udpsrc0 time: 0.013 ms
capsfilter0 time: 0.009 ms
rtph264depay0 time: 0.448 ms
udpsrc0 time: 0.010 ms
capsfilter0 time: 0.008 ms
rtph264depay0 time: 0.021 ms
h264parse0 time: 0.025 ms
avdec_h264-0 time: 5.935 ms
videoconvert0 time: 0.041 ms
autovideosink0 time: 13.408 ms, full time: 40.562

udpsrc0 time: 0.036 ms
capsfilter0 time: 0.020 ms
rtph264depay0 time: 0.734 ms
udpsrc0 time: 0.016 ms
capsfilter0 time: 0.010 ms
rtph264depay0 time: 0.641 ms
udpsrc0 time: 0.010 ms
capsfilter0 time: 0.008 ms
rtph264depay0 time: 0.021 ms
h264parse0 time: 0.023 ms
avdec_h264-0 time: 17.155 ms
udpsrc0 time: 0.040 ms
capsfilter0 time: 0.017 ms
rtph264depay0 time: 0.721 ms
udpsrc0 time: 0.012 ms
capsfilter0 time: 0.009 ms
rtph264depay0 time: 0.735 ms
udpsrc0 time: 0.023 ms
capsfilter0 time: 0.014 ms
rtph264depay0 time: 0.034 ms
h264parse0 time: 0.036 ms
avdec_h264-0 time: 6.248 ms
videoconvert0 time: 0.039 ms
autovideosink0 time: 11.642 ms, full time: 38.244

udpsrc0 time: 0.033 ms
capsfilter0 time: 0.016 ms
rtph264depay0 time: 0.770 ms
udpsrc0 time: 0.029 ms
capsfilter0 time: 0.017 ms
rtph264depay0 time: 0.347 ms
udpsrc0 time: 0.016 ms
capsfilter0 time: 0.013 ms
rtph264depay0 time: 0.031 ms
h264parse0 time: 0.038 ms
avdec_h264-0 time: 19.574 ms
udpsrc0 time: 0.038 ms
capsfilter0 time: 0.024 ms
rtph264depay0 time: 0.723 ms
udpsrc0 time: 0.013 ms
capsfilter0 time: 0.008 ms
rtph264depay0 time: 0.383 ms
udpsrc0 time: 0.012 ms
capsfilter0 time: 0.008 ms
rtph264depay0 time: 0.021 ms
h264parse0 time: 0.023 ms
avdec_h264-0 time: 4.833 ms
videoconvert0 time: 0.032 ms
autovideosink0 time: 12.463 ms, full time: 39.467

udpsrc0 time: 0.033 ms
capsfilter0 time: 0.014 ms
rtph264depay0 time: 0.745 ms
udpsrc0 time: 0.012 ms
capsfilter0 time: 0.010 ms
rtph264depay0 time: 0.352 ms
udpsrc0 time: 0.013 ms
capsfilter0 time: 0.008 ms
rtph264depay0 time: 0.022 ms
h264parse0 time: 0.022 ms
avdec_h264-0 time: 18.559 ms
udpsrc0 time: 0.034 ms
capsfilter0 time: 0.013 ms
rtph264depay0 time: 0.753 ms
udpsrc0 time: 0.013 ms
capsfilter0 time: 0.008 ms
rtph264depay0 time: 0.418 ms
udpsrc0 time: 0.014 ms
capsfilter0 time: 0.008 ms
rtph264depay0 time: 0.021 ms
h264parse0 time: 0.023 ms
avdec_h264-0 time: 4.812 ms
videoconvert0 time: 0.027 ms
autovideosink0 time: 14.682 ms, full time: 40.616

Also, I measured network latency:

ping -s 20000 -c 8 192.168.1.2
PING 192.168.1.2 (192.168.1.2) 20000(20028) bytes of data.
20008 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=3.52 ms
20008 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=3.51 ms
20008 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=3.54 ms
20008 bytes from 192.168.1.2: icmp_seq=4 ttl=64 time=3.51 ms
20008 bytes from 192.168.1.2: icmp_seq=5 ttl=64 time=3.49 ms
20008 bytes from 192.168.1.2: icmp_seq=6 ttl=64 time=3.50 ms
20008 bytes from 192.168.1.2: icmp_seq=7 ttl=64 time=3.50 ms
20008 bytes from 192.168.1.2: icmp_seq=8 ttl=64 time=3.49 ms

And the results are unclear to me. My IP camera source has 60fps, while the monitor has a 60Hz refresh rate. So, why is the full interval between frames around 20ms and 40ms, rather than around 1/60s? Additionally, why is the full (IP camera encode + desktop decode and display) pipeline latency for both variants the same, at around 90-120ms?

I strongly desire to achieve a full latency of not more than 60ms on a device that can decode a frame in 1ms :)

What is your device (which GPU?)? What is the CPU loading when you run the UDP pipeline?

Let’s make the things easier. Please test the decoding latency with local video file(H264 raw data) instead of udp stream first.

I mentioned that my device is an RTX 3070. Additionally, I’ve reinstalled the driver and can provide the software versions:

  • Kernel 6.5.0-15-generic #15~22.04.1-Ubuntu SMP x86_64
  • NVIDIA Version: 545.23.08
  • CUDA Version: 12.3
  • deepstream-6.4_6.4.0-1_amd64

CPU load:

I don’t understand what you mean by latency in the context of a video file since it is not tied to the real world. In the context of a video file, I can only talk about frame processing time, and everything looks clear.

filesrc location=/opt/nvidia/deepstream/deepstream-6.4/samples/streams/sample_720p.h264 ! h264parse ! nvv4l2decoder ! nveglglessink sync=false

The only note was that I had to remove the parameter ‘low-latency-mode=true’ as it caused a strange effect, with frames being displayed in the wrong order in scale of fraction second.

filesrc0 time: 0.021 ms
h264parse0 time: 0.036 ms
nvv4l2decoder0 time: 14.957 ms
eglglessink0 time: 0.961 ms, full time: 15.975

nvv4l2decoder0 time: 15.557 ms
eglglessink0 time: 0.986 ms, full time: 16.543

nvv4l2decoder0 time: 15.707 ms
eglglessink0 time: 0.977 ms, full time: 16.685

nvv4l2decoder0 time: 15.627 ms
eglglessink0 time: 0.978 ms, full time: 16.605

nvv4l2decoder0 time: 15.734 ms
eglglessink0 time: 0.982 ms, full time: 16.716

nvv4l2decoder0 time: 15.827 ms
eglglessink0 time: 1.035 ms, full time: 16.862

nvv4l2decoder0 time: 15.591 ms
eglglessink0 time: 0.997 ms, full time: 16.588

nvv4l2decoder0 time: 15.763 ms
eglglessink0 time: 1.068 ms, full time: 16.831

filesrc0 time: 0.022 ms
h264parse0 time: 0.058 ms
nvv4l2decoder0 time: 15.326 ms
eglglessink0 time: 0.978 ms, full time: 16.384

nvv4l2decoder0 time: 15.707 ms
eglglessink0 time: 0.991 ms, full time: 16.698

nvv4l2decoder0 time: 15.713 ms
eglglessink0 time: 0.972 ms, full time: 16.685

nvv4l2decoder0 time: 15.782 ms
eglglessink0 time: 1.011 ms, full time: 16.793

nvv4l2decoder0 time: 15.671 ms
eglglessink0 time: 1.168 ms, full time: 16.839

nvv4l2decoder0 time: 15.388 ms
eglglessink0 time: 1.114 ms, full time: 16.502

No. For the compressed encoded video data, the latency depends on a serial of frames but not just single frames. The decoding time fluctuate from frame to frame. What we talk about is just the average decoding latency. The decoding of the current frame may be influenced by the previous frames. When the compressed video data be transferred through ethernet, the UDP packets reach to the destination in arbitrary orders and with different latencies. The frame data are constructed by many pieces from the different UDP packets. So it takes extra time to wait for the UDP packets and reconstruct frame data from the UDP packets received. That is why we don’t want to count the ethernet transferring and protocol time into the whole latency. It is unpredictable.

From the data you put here, the decoding latency is around 15~16 ms. It is OK.

I get the impression that you are simply ignoring what I am saying and promoting your opinion that this is all a network delay, it won’t get any better. And if I believe you, then I need to admit that on the RTX 3070 I will not achieve a better result than 100ms. Despite the fact that I wrote that the Jetson nano shows 60ms. And I also managed to test the board from Aliexpress for $15 that also show 60ms. Well, I don’t see any point in continuing this fruitless discussion, considering that you don’t owe me anything :) And to be honest, I think that you are not competent enough to solve it.