No increase using tee and parallel inference on AGX

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson AGX )
• DeepStream Version: 6.3
• JetPack Version : 5.1
• TensorRT Version : 8.5.2
• Issue Type( questions)

I have built pipeline using tee element to run 3 yolov6 models in parallel on jetson AGX. I was expecting decrease in inference time, however I am getting the same time as if they are running sequential. My question, Is it expected to gain more fps using tee element and run models in parallel ?

what is the whole media pipeline? please refer to the parallel inference sample deepstream_parallel_inference_app.


Thanks for your reply. Kindly find the pipeline below.

Sorry for the late reply, Is this still an DeepStream issue to support?
in theory, parallel inference( tee + queue) is faster than sequential inference. how did you measure the fps?
why do you need to run the same model in parallel?

These are not the same models. I have 3 models and need to build a pipeline to run them all. As they are not dependent on each other I wanted to run them in parallel.

I am using GstNvDslogger element to measure the fps.

  1. here are some reasons. there is a buffer pool in streammux. streammux making batch is fast while inference is slow. streammux will always wait until the buffer returns the pool. if using tee, the batch data is not copied to queue. it is a just a reference. after the the models finishes inference. the buffer will return streammux’s buffer pool. so the consumption is similar to sequential inference. in sample deepstream_parallel_inference_app above. the pipeline is designed to “streammux +tee +streamdemux + streammux”, the first streammux will not wait because the buffer will return after the second streammux.
  2. deepstream_parallel_inference_app will merge the meta from different branches. if no need to merge meta. why not use three gst-launch command-lines and each command-line uses one model in your application?

Ok, I was wondering why “streammux +tee +streamdemux + streammux” is used in your pipeline. I thought it is just to select one stream for each branch. I will add these elements to the pipeline and benchmark the performance.

For the second point about merging the meta. I am implementing the pipeline in python code because I want to integrate it with other application. I am going to use probes to get the metadata. Do you think merging the meta will increase the FPS ?

The first function is selecting source for each branch. the second function is for parallel inference. every branch does not use shared batch data because the second streammux will create new buffer for each branch.

no.

Hello @fanzh ,

Thank you for your support. I have built the following pipeline following your advice by adding ““streammux +tee +streamdemux + streammux”, however I am getting 0 fps in the logger. I do not know what is wrong. Could you please check the pipeline.

can you narrow down this issue? from example,

  1. add printing in probe function to check which element did not output data.
  2. if using tee with one branch, can the app run well?
  3. you might dump deepstream_parallel_inference_app’s pipeline to do some comparisons.

The pipeline runs when I create with only one branch as following:

Streamdemux is causing the issue. when I remove it from the second branch. it runs with no issues as following.

Finally, I have tried to keep streamdemux in branch 1 and remove it from branch 2 and 3 and it worked, however no effect on fps. FPS still the same.

Is streamdemux effective element to build correct parallel pipeline? Or it has no effect as long as I am adding 2 streammux ?

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

if no need to merge metadata, please refer to the following pipeline. it works well.

gst-launch-1.0 filesrc location = /opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! m.sink_0 nvstreammux name=m batch-size=2 width=1280 height=1280 ! queue ! nvstreamdemux name=demux0 demux0.src_0 ! tee name=srctee0 srctee0. ! queue ! m0.sink_0 nvstreammux name=m0 batch-size=1 width=1280 height=720 ! fakesink srctee0. ! queue ! m1.sink_0 nvstreammux name=m1 batch-size=1 width=1280 height=720 ! fakesink

the simplified pipeline is

......! nvv4l2decoder ! streammux ! streamdemux ! tee ! streammux ! fakesink
                                                      ! streammux ! fakesink

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.