DS 5.0.1 nvstreammux batch-size bug?

paul.bridger · October 14, 2020, 11:06am

• 2080Ti
• 5.0.1
**• NVIDIA-SMI 450.57 Driver Version: 450.57 CUDA Version: 11.0 **

Using an entirely stock DS 5.0.1 container I am able to reliably produce a pipeline hang bug involving the nvstreammux batch-size parameter:

$ docker run -it --gpus all --ipc=host --rm nvcr.io/nvidia/deepstream:5.0.1-20.09-triton
<container starts normally>
# gst-launch-1.0 filesrc location=/opt/nvidia/deepstream/deepstream-5.0/samples/streams/sample_1080p_h264.mp4 ! decodebin ! mux.sink_0 nvstreammux name=mux batch-size=1 width=1920 height=1080 batched-push-timeout=100 ! fakesink
<output normal, stream terminates fine>
# gst-launch-1.0 filesrc location=/opt/nvidia/deepstream/deepstream-5.0/samples/streams/sample_1080p_h264.mp4 ! decodebin ! mux.sink_0 nvstreammux name=mux batch-size=4 width=1920 height=1080 batched-push-timeout=100 ! fakesink
<output normal, stream terminates fine>
# gst-launch-1.0 filesrc location=/opt/nvidia/deepstream/deepstream-5.0/samples/streams/sample_1080p_h264.mp4 ! decodebin ! mux.sink_0 nvstreammux name=mux batch-size=4 width=1920 height=1080 batched-push-timeout=1000000 ! fakesink
<pipeline hangs after nvstreammux sends the first batch downstream>
# gst-launch-1.0 filesrc location=/opt/nvidia/deepstream/deepstream-5.0/samples/streams/sample_1080p_h264.mp4 ! decodebin ! mux.sink_0 nvstreammux name=mux batch-size=2 width=1920 height=1080 batched-push-timeout=1000000 ! fakesink
<output normal, stream terminates fine>

The issue seems to occur whenever the combination of batch-size and batched-push-timeout will result in >2 frames being put into the batch.
This issue occurs in 5.0.0 also.
Using qtdemux ! h264parse ! nvv4l2decoder as a replacement for decodebin changes nothing.

paul.bridger · October 14, 2020, 11:18am

Same bug exists in nvcr.io/nvidia/deepstream:4.0.2-19.12-devel

# gst-launch-1.0 filesrc location=/root/deepstream_sdk_v4.0.2_x86_64/samples/streams/sample_1080p_h264.mp4 ! decodebin ! mux.sink_0 nvstreammux name=mux batch-size=4 width=1920 height=1080 batched-push-timeout=1000000 ! fakesink
<hangs>
# gst-launch-1.0 filesrc location=/root/deepstream_sdk_v4.0.2_x86_64/samples/streams/sample_1080p_h264.mp4 ! decodebin ! mux.sink_0 nvstreammux name=mux batch-size=2 width=1920 height=1080 batched-push-timeout=1000000 ! fakesink
<fine>

Fiona.Chen · October 15, 2020, 1:48am

In the FAQ document, https://docs.nvidia.com/metropolis/deepstream/dev-guide/index.html#page/DeepStream_Faq_2019/deepstream_plugin_faq.html#

We have mentioned:

What is the difference between batch-size of nvstreammux and nvinfer? What are the recommended values for nvstreammux’ batch-size?

nvstreammux’ batch-size is the number of buffers(frames) it will batch together in one muxed buffer. nvinfer’s batch-size is the number of frame(primary-mode)/objects(secondary-mode) it will infer together.

We recommend that the nvstreammux’ batch-size be set to either number of sources linked to it or the primary nvinfer’s batch-size.

paul.bridger · October 15, 2020, 2:10am

The same hanging behaviour happens when the above pipelines have a downstream nvinfer element with a matching batch-size. I removed the nvinfer in order to simplify the example and so you could be sure the downstream elements are not blocking.

For instance, this pipeline will also hang:
filesrc → decode → nvstreammux batch-size=4 → nvinfer batch-size=4 → fakesink

Fiona.Chen · October 15, 2020, 2:12am

What will happen if you use larger batched-push-timeout value?

paul.bridger · October 15, 2020, 2:25am

The hang behaviour is triggered by the combination of batch-size and batched-push-timeout, such that the element will hang if more than 2 frames would be put into the buffer.

I have tested batch-size=4, =8, =16 and all are fine so long as the batched-push-timeout is short enough that no more than 2 frames can be put into the buffer.

Another part of this behaviour I’ve identified which must have contributed to it not being detected sooner: the limit is not batch-size=2 frames, the limit is batch-size=2*num_inputs frames. So, if your nvstreammux has 4 filesrc → decode → queue inputs then nvstreammux will work nicely with batch-size=8 and hang with batch-size=9.

Fiona.Chen · October 15, 2020, 2:36am

batched-push-timeout will guarantee the pipeline not hang. It is can be calculated with the following formula:

batch_size * 10^6 / (fps * the number of sources)

paul.bridger · October 15, 2020, 3:02am

Please correct me if I misunderstood but that formula is useful mainly for live or clock-synchronized sources, right?

# gst-launch-1.0 filesrc location=/opt/nvidia/deepstream/deepstream-5.0/samples/streams/sample_1080p_h264.mp4 ! decodebin ! mux.sink_0 nvstreammux name=mux batch-size=4 width=1920 height=1080 batched-push-timeout=1000000 ! fakesink

If I understand correctly in the above pipeline with batched-push-timeout of 1e6 we should never wait more than 1e6 us (1 second) for buffers to be sent downstream.

Since we’re reading a video from disk we would expect to always fill the 4-slot buffer before 1e6 us is finished so the timeout would not be relevant. However the above pipeline sends the first buffer and then seems to wait forever - considerably longer than 1 second.

Have you been able to reproduce this? I’ve tried hard to make this very quick to test.

Fiona.Chen · October 15, 2020, 3:19am

You need to use the following command line for your case, so that the timestamp will be ignored.
gst-launch-1.0 filesrc location=/opt/nvidia/deepstream/deepstream-5.0/samples/streams/sample_1080p_h264.mp4 ! decodebin ! queue ! mux.sink_0 nvstreammux name=mux batch-size=4 width=1920 height=1080 batched-push-timeout=40000 live-source=1 ! fakesink

paul.bridger · October 15, 2020, 3:40am

I’ve tested this, and while it allows us to write batch-size=4 it sends buffers with only a single frame inside. If you run this, you’ll see what I mean:

$ GST_DEBUG=nvstreammux:7 gst-launch-1.0 filesrc num-buffers=20 location=/opt/nvidia/deepstream/deepstream-5.0/samples/streams/sample_1080p_h264.mp4 ! decodebin ! queue ! mux.sink_0 nvstreammux name=mux batch-size=4 width=1920 height=1080 batched-push-timeout=40000 live-source=1 ! fakesink 2>&1 |grep 'buffer 0x'

Synchronizing the pipeline to the video clock (with live-source=1) is not what I want to do. It does however sidestep the bug I’m reporting because the short timeout sends single-frame buffers downstream.

If you remove live-source=1 and run it again you’ll see what I mean:

# GST_DEBUG=nvstreammux:7 gst-launch-1.0 filesrc num-buffers=20 location=/opt/nvidia/deepstream/deepstream-5.0/samples/streams/sample_1080p_h264.mp4 ! decodebin ! queue ! mux.sink_0 nvstreammux name=mux batch-size=4 width=1920 height=1080 batched-push-timeout=40000 ! fakesink>&1 |grep 'buffer 0x'

This command sends a single 4-frame buffer downstream and then hangs.

Fiona.Chen · October 16, 2020, 9:54am

This is caused by the HW decoder and fakesink. The HW decoder has only 4 buffers in the bufferpool, when the four buffers are sent to fakesink as a whole batch, the fakesink will hold them as the last frame. So th decoder will stop and wait for fakesink to release the buffers.

We need to let fakesink to release the buffers. Please try the following pipeline:
gst-launch-1.0 filesrc num-buffers=20 location=/opt/nvidia/deepstream/deepstream-5.0/samples/streams/sample_1080p_h264.mp4 ! decodebin ! queue ! mux.sink_0 nvstreammux name=mux batch-size=4 width=1920 height=1080 batched-push-timeout=40000 ! fakesink enable-last-sample=0

paul.bridger · October 16, 2020, 12:59pm

Great job and thanks for working this out, the behaviour makes sense now. I’ve tested and I see the same outcome.

So this means that due to the limited HW decoder buffer size, the maximum possible batch-size will be 4x the number of decoders so long as elements have enable-last-sample=0 (and 2x number of decoders if anyone has enable-last-sample=1).

I think this is clearly enough for live sources, but is limiting in the case of extremely high frame rate video or for most efficient non-live processing. Usually for the downstream models a batch size of 4 will result in low hardware utilization.

Thanks again for digging deeper and giving a detailed response. Is there any chance we could have a parameter exposed on the nvv4l2decoder element to change number of decode buffers?

Fiona.Chen · October 19, 2020, 1:54am

Surely we can set more decoder buffers to let the pipeline work. If you use the following pipeline, it can work. But the system memory consumption will raise with the extra decoder buffer.
gst-launch-1.0 filesrc num-buffers=20 location=/opt/nvidia/deepstream/deepstream-5.0/samples/streams/sample_1080p_h264.mp4 ! qtdemux name=demux demux.video_0 ! h264parse ! nvv4l2decoder num-extra-surfaces=1 ! mux.sink_0 nvstreammux name=mux batch-size=4 width=1920 height=1080 batched-push-timeout=40000 ! fakesink