Multi-stream Deepstream 9.0 app

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) - RTX A100 80GB
• DeepStream Version 9.0
• TensorRT Version 10.16.1
• NVIDIA GPU Driver Version (valid for GPU only) 580.126.20
• Issue Type( questions, new requirements, bugs) - Clarification on the best batch size selection.
Pipeline is roughly:

uridecodebin/rtspsrc → nvstreammux → nvinfer → nvtracker → nvvideoconvert → fakesink

Model: custom YOLO exported to TensorRT FP16
Input: 640x640
Output: [batch, 300, 6]

My goal is to keep GPU utilization stable below ~90%, not only maximize average throughput.

I tested two approaches:

  1. streammux batch-size=80, nvinfer batch-size=80, FP16 TensorRT engine with max batch 80
  2. streammux batch-size=32, nvinfer batch-size=32, FP16 TensorRT engine with max batch 32, while still connecting 80 RTSP sources

With batch 80, average throughput is good, but I see sudden GPU SM spikes. With batch 32 and interval=2, the runtime is much more stable in my tests.

My question:

For 80 live RTSP sources, is it generally better to build/use a batch-80 engine and let nvinfer process one large batch, or use a smaller batch-32 engine and let DeepStream/nvinfer process the 80 sources in smaller chunks? Is there a possibility that it will somehow “fall behind” and just inference wont keep up with decoded frames?

What are the practical advantages/disadvantages of each approach in DeepStream?

Specifically, I want to understand:

  • Does nvinfer internally split larger nvstreammux batches into smaller inference chunks when nvinfer batch-size is smaller than the number of sources?
  • Is using streammux batch-size=32 with 80 live sources a recommended/valid approach?
  • Can smaller nvinfer batches reduce GPU utilization spikes even if average utilization is similar?
  • Are there latency or frame-dropping side effects when streammux batch-size is smaller than the number of live sources?
  • For live RTSP surveillance, should batch size be optimized for throughput, latency, or GPU utilization stability?

Any guidance or best practices for 80-camera DeepStream deployments would be appreciated.

Can you explain what the “average throughput” mean? Does “the runtime is much more stable” mean the GPU usage is smooth?

From the model view, it depends on the model itself. Whether the batch size 80 engine for one time is better than batch size 32 engine for 3 times depends on the model itself.

From the whole pipeline’s view, if your concern is the GPU usage, the model is not the only component who uses GPU, the tracker and sometimes postprocessing if it is implemented with CUDA will also use GPU too. They work in parallel, so it is not calculable.

Nvinfer internally splits larger nvstreammux batches into smaller inference chunks. You have set the nvstreammux batch size as the same to the nvinfer batch size in your test, so nvinfer handle the nvstreammux batch one time.
With your test case, the split is done by nvstreammux, nvstreammux will compose the batch size 32 batch from your 80 input streams, if your live streams are stable enough and the nvstreammux parameters are set properly, 3 batches may be generated for your 80 streams each time.

No.

No. It also depends on your model itself and the objects detected since you have tracker in your pipeline.

Yes.

What does the “throughput” mean? Which latency do you mean in this sentence? The network latency, inferencing latency or other ?