'tee-queue' bug under the nvstreammux situation

• Hardware Platform (Jetson / GPU) GeForce RXT 2080ti
• DeepStream Version: 6.0
• TensorRT Version: 8.0.1.6+cuda11.3.1.005
• NVIDIA GPU Driver Version (valid for GPU only): 470.103.01
• Issue Type( questions, new requirements, bugs): bugs

Hi all,

I’m currently working on an simple camera image-capture project.

I used the ‘tee-queue’ pattern but found some things weird, especially under the deepstream situation.

Here is my first test. It’s a pure plain gstreamer job.

# TEST CASE 1:
$ gst-launch-1.0 \
    videotestsrc pattern=18 is-live=1 ! 'video/x-raw, width=1920, height=1080' ! \
    tee name=t \
    t. ! queue ! videoconvert ! nveglglessink sync=0 \
    t. ! queue max-size-buffers=0 max-size-bytes=0 max-size-time=0 ! videoconvert ! nveglglessink name=sink0 sync=0

I also wrote a probe function of gst-python to hook the ‘sink’ pad of sink0 in the 2nd tee-branch. As you can see the probe function was very simple. It just slept for random msec, for simulating some costs during the capture.

import random

def capture_probe_func(pad, info):
    delay = random.randint(0,2)
    GLib.usleep(delay*200000) # 200ms x delay, to simulate some processing costs.
    return Gst.PadProbeReturn.OK

The result seems as good as expected. Notice that the random delays of the 2nd tee-branch will never block the running of the 1st tee-branch, because they are separated by different threads (buffered queues). The display of the 1st tee-branch is very smooth.

Now there is my 2nd test, with deepstream. Things BAD happened.

# TEST CASE 2:
$ gst-launch-1.0 \
    videotestsrc pattern=18 is-live=1 ! 'video/x-raw, width=1920, height=1080, framerate=25/1' ! nvvideoconvert ! 'video/x-raw(memory:NVMM)' ! queue ! m.sink_0 \
    videotestsrc pattern=18 is-live=1 ! 'video/x-raw, width=1920, height=1080, framerate=25/1' ! nvvideoconvert ! 'video/x-raw(memory:NVMM)' ! queue ! m.sink_1 \
    videotestsrc pattern=18 is-live=1 ! 'video/x-raw, width=1920, height=1080, framerate=25/1' ! nvvideoconvert ! 'video/x-raw(memory:NVMM)' ! queue ! m.sink_2 \
    videotestsrc pattern=18 is-live=1 ! 'video/x-raw, width=1920, height=1080, framerate=25/1' ! nvvideoconvert ! 'video/x-raw(memory:NVMM)' ! queue ! m.sink_3 \
    nvstreammux name=m batch-size=4 width=1920 height=1080 live-source=1 batched-push-timeout=40000 sync-inputs=0 attach-sys-ts=0 nvbuf-memory-type=0 ! \
    tee name=t \
    t. ! queue ! nvmultistreamtiler rows=2 columns=2 ! nveglglessink sync=0 \
    t. ! queue max-size-buffers=0 max-size-bytes=0 max-size-time=0 ! fakesink name=sink0 sync=0

I used the same probe above to hook the ‘sink’ pad of sink0 in the 2nd tee-branch.

But at this time, the display of 1st branch became frequently choked. It seems the performance of the 1st branch has been affected by the random sleeps of the 2nd tee-branch seriously. The queue in 2nd tee-branch was totally USELESS to decouple the running of the two branches.

WHAT did this happen? All I want is:

  1. An ‘independent enough’ behavior of the 2nd tee-branch that does not block the 1st branch under the nvstreammux situation, which means to support the buffering of batched-frames in ‘queue’ element.

  2. I also want my probe func could capture EVERY frame AT ITS OWN PACE (assume that I can afford a lot of queue buffers in 2nd branch), instead of DISCARDING frames to catch up the pace of 1st branch by using `queue leaky=2 max-size-buffers=1’ in the 2nd tee-branch.

  3. Behavior consistence between plain gstreamer & deepstream for the tee-queue pattern.

Can anyone kindly enough to explain this for me?

Thanks in advance,

Ng.

nvstreammux will use buffer pool to reuse video buffer. So it will stuck if any downstream plugin hold the video buffer.

1 Like

Thanks for your kindly reply!

I understand the resources of gpu memories are precious, and the design decision of refcount and pool blocking may be reasonable. But in many scenarios we may need an ‘async-message-queue’ alike behavior that does not lose frame (for example, the high-fidelity frames capture), while at the same time to keep the display rendering smooth.

What I suggest here is an mechanism that allow us to convert the gpu frame into cpu frame (which could be allowed to be queued in more cheaper memory), while keeping the inference metadata in the converted frames, for the convenience of latter analysis. Just for your consideration: )

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.