Parallel Inference vs Tee

alfonso-corrado · August 22, 2024, 12:41pm

• Hardware Platform (Jetson / GPU) GPU
• DeepStream Version 7.0 (docker image: nvcr.io/nvidia/deepstream:7.0-triton-multiarch )
• NVIDIA GPU Driver Version (valid for GPU only) 535.171.04

Hi,

I’m trying to understand the differences between two pipeline structures in DeepStream. Specifically, I’m comparing a pipeline similar to the one used in the deepstream_parallel_inference_app:

...... ! streammux ! nvinfer ! nvtracker ! streamdemux ! tee ! streammux ! nvinfer ! fakesink
                                                             ! streammux ! nvinfer ! fakesink

with a simpler version that uses only the tee element:

...... ! streammux ! nvinfer ! nvtracker ! tee ! nvinfer ! fakesink
                                               ! nvinfer ! fakesink

I’ve already read through these discussions on the forum: No increase using tee and parallel inference on AGX and Parallel branching in DeepStream 6.4.

From what I understand, in the parallel inference case, the buffer is copied between different branches, allowing each branch to work on its own copy. In contrast, with the tee approach, the buffer is shared across all branches.

However, I’m still unclear on the benefits of parallel inference.
What does it mean when we say that models run in parallel in context of parallel inference?
In which structure (parallel inference vs. tee) can the models operate each at different frame rates depending on their respective inference speeds?

Any insights would be greatly appreciated!

Thanks.

Fiona.Chen · August 23, 2024, 1:48am

alfonso-corrado:

Specifically, I’m comparing a pipeline similar to the one used in the deepstream_parallel_inference_app:
...... ! streammux ! nvinfer ! nvtracker ! streamdemux ! tee ! streammux ! nvinfer ! fakesink
                                                             ! streammux ! nvinfer ! fakesink

No. This pipeline is not similar to deepstream_parallel_inference_app. The deepstream_parallel_inference_app pipeline is deepstream_parallel_inference_app/common.png at master · NVIDIA-AI-IOT/deepstream_parallel_inference_app (github.com).

No. deepstream_parallel_inference_app shared the batched buffer between the branches and reorganize the new batches inside branches. Your pipeline with “tee” just shared batched buffer between branches.

If you are talking about deepstream_parallel_inference_app, one benefit is to avoid extra buffer copy while sharing buffer between branches. Another benefit is to converge the duplicated batch meta by “metamux”.

If you want to talk about the inference speed, the live source and local file source may be different cases.
For local file sources, if you disable the “sync” of the sink, the inference speed is decided by the slowest branch(model) in the pipeline.

For live sources, the inference speed is decided by many factors. The input sources FPSs, the model inference time and the pipeline output speed will all impact the final speed.

alfonso-corrado · August 23, 2024, 8:51am

Thank you for the response.

There are still a few aspects I’m unclear about.

Below is a graph I generated from the deepstream_parallel_inference_app using the configuration file: configs/apps/bodypose_yolo_lpr/source4_1080p_dec_parallel_infer.yml.

parallel_inference_app_pipeline_graph.zip (1.8 MB)

I noticed some differences between this graph and the image shown on deepstream_parallel_inference_app/common.png at master · NVIDIA-AI-IOT/deepstream_parallel_inference_app (github.com). In the GitHub image, it appears that there’s only one tee, followed by a streamdemux and streammux for each branch. However, in the graph I generated, there is only one streamdemux, a separate tee for each source, and a streammux for each branch.
Why is there this discrepancy? The image on GitHub doesn’t seem to accurately reflect how the pipeline is actually structured.

If I understand correctly, in both cases (whether using a simple tee or the approach in deepstream_parallel_inference_app) the buffer is not copied; it is the same and shared across all branches. One key difference, however, is that with the deepstream_parallel_inference_app approach, you can select which sources are processed in each branch, which isn’t possible with the tee approach. Is that correct?

Another difference, I believe, is that with the tee approach, there is a single batch meta shared between branches, whereas in deepstream_parallel_inference_app, each branch has its own batch meta copy, necessitating the use of a metamux at the end to aggregate them. Is this accurate?

Are there any other differences between these two approaches?

My goal is to have multiple models running in parallel. Ideally, I want each branch to process the video stream at its own frame rate, depending on the speed of the models involved. For example, if I have an RTSP video input at 30fps and a parallel pipeline with two branches: Branch1 with a smaller model and Branch2 with a larger model. I would like Branch1 to maintain 30fps, while Branch2, if it can’t keep up, should reduce its speed (e.g., to 15fps), without affecting Branch1’s frame rate.
Specifically, I need a pipeline that starts with a primary detection model, followed by a split into multiple branches as described above. I do not need to aggregate the metadata from each branch.

Is it possible to achieve this? Which approach would you recommend?

Thanks.

Fiona.Chen · August 23, 2024, 9:47am

The picture is for the older version. The graph of the actual pipeline may be a little different.

yes.

Yes.

No.

No. The frame rate is decided by the timestamps of the video(no matter it is a local video file or a live video), you can’t control the framerate except you have your own plugin to change the timestamps.

For this case, they are two separated pipelines because the sources are different. After you change the video’s framerate(it should be implemented by some plugins which can change the timestamps and insert/drop frames, E.G. videorate), the video is another video. The change should be done before nvstreammux( which generates batched data for inferencing) because nvstreammux is sensitive to the timestamps and the batched frames’ timestamps can’t be changed.

The gst-nvinfer supports to skip the inferencing on some batches with “interval” parameter Gst-nvinfer — DeepStream documentation, this will not change the video framerate but the inferencing loading will be reduced and under control. If you can accept this way, the deepstream_parallel_inference_app can be used with your case.

alfonso-corrado · August 23, 2024, 10:40am

Thanks for the detailed response; I now clearly understand the differences between the tee approach and the parallel_inference_app approach.

To clarify my goal from my previous post: I want to build a pipeline that starts with a primary inference Detector and Tracker, and then splits into several branches, each with a different inference model. Here’s an example of the structure:

...... ! streammux ! queue ! nvinfer ! queue ! nvtracker ! tee ! queue ! nvinfer ! fakesink
                                                               ! queue ! nvinfer ! fakesink

In this example, let’s say I have two branches: Branch 1 with a smaller Model 1 and Branch 2 with a larger Model 2. I want each model to operate at its maximum possible speed.

Assume the RTSP video input is at 30 FPS, and the primary detector is fast enough to process the live video in real time without dropping any batches (with an inference time of less than 1/30th of a second).

I would like Branch 1 and Branch 2 to dynamically adjust how many batches they drop based on the inference times of Model 1 and Model 2. For instance, if Model 1 has an inference time of approximately 1/20th of a second and Model 2 takes around 1/10th of a second, I want Branch 1 to drop 1 out of every 3 batches and Branch 2 to drop 2 out of every 3 batches. This would result in Model 1 processing 20 batches per second and Model 2 processing 10 batches per second.

I know this can be achieved using the interval parameter in Gst-nvinfer, but I’m wondering if there is a way to achieve this dynamically, perhaps by configuring the queue parameters before each nvinfer.

Currently, I have set leaky=2 and max-size-buffers=1 for the queues.

The core idea is that each model should always perform inference on the most recent available batch, dropping all older batches.

Is it possible to achieve this?

Thanks.

Fiona.Chen · August 23, 2024, 11:18am

DeepStream provide no plugin or functions to drop batches as you said. Only the gst-nvinfer can skip some batches. No batch will be dropped.

Yes, the “interval” property can be set dynamically during the pipeline in “PLAYING” state.

It may work but it is dangerous to the controlling of the final output. Especially for the live sources, the batches may be organized in different ways, and the batch’s timestamp does not equal to the frames’ timestamps. it is hard to decide which batches should be dropped. It is not recommended to change the batches.

alfonso-corrado · August 23, 2024, 1:43pm

I’m sorry, but I’m not sure I understand what you mean here:

Could you please clarify?

Additionally, when I said:

I didn’t necessarily mean dropping batches, simply skipping them would be sufficient.

And regarding this:

I don’t want to set the interval property dynamically while the pipeline is in the “PLAYING” state.
Ideally, I’m looking for a solution, whether it’s a plugin before nvinfer or something else, that can skip over batches when nvinfer is busy with an inference process. The goal is for nvinfer to process only the most recent available batch once it’s free, skipping any older ones.
By “dynamic skipping,” I mean skipping batches only when necessary (when nvinfer is busy) rather than specifying a fixed number of consecutive batches to be skipped using the interval parameter.
Is this possible?

Thanks!

Fiona.Chen · August 26, 2024, 1:52am

When you do something to the GstBuffer(it is the batch after nvstreammux) in one branch, the same thing will happen in the other branches. Seems your purpose is to do different things to different branches, so this is not recommended.

“interval” can help to skip.

Understand your purpose.

The “queue” may work. But the “queue” works with “drop” but not “skip”.

Topic		Replies	Views
No increase using tee and parallel inference on AGX DeepStream SDK	11	713	January 25, 2024
DeepStream Pipeline Synchronization Issues & Conditional Inference Activation DeepStream SDK jetson-inference , gstreamer , python , deepstream	16	378	March 19, 2025
Parallel branching in deepstream 6.4 DeepStream SDK	10	723	May 21, 2024
Issues when using tee with output-selector DeepStream SDK gstreamer , deepstream	19	415	July 31, 2025
Does DeepStream support running multiple classifiers in parallel after a detector? DeepStream SDK	8	1084	May 10, 2022
There is a confusing bug by using multiple "nvinfer" in parallel by "tee" DeepStream SDK deepstream	8	258	October 21, 2024
Dynamically Enabling/Disabling Inference Branches in a DeepStream Pipeline DeepStream SDK jetson-inference , python , deepstream	16	460	March 25, 2025
Nvinfer multi-threading DeepStream SDK deepstream	13	2053	March 11, 2022
Parallel Inference Demo using Python DeepStream SDK	3	1005	March 30, 2023
Parallel execution of branches DeepStream SDK	7	1169	November 27, 2020

Parallel Inference vs Tee

Related topics