How many RTSP stream rackserver CPU can take?

Please provide complete information as applicable to your setup.

• Hardware Platform: GPU
• DeepStream Version: 8.0
• JetPack Version (valid for Jetson only): Not use
• TensorRT Version: Latest
• NVIDIA GPU Driver Version (valid for GPU only): Latest
• Issue Type( questions, new requirements, bugs): Question
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)

🟢 [Question] How to reduce CPU bottleneck when handling RTSP streams before DeepStream decoding?

Hi everyone,

I’m currently working on an AI surveillance project.
My current hardware setup includes:

  • 1× Rack Server

    • CPU: 2 × Intel® Xeon® Silver 4510 (12C/24T, 2.4GHz Base / 4.1GHz Boost)

    • RAM: 128GB DDR5 (2 × 64GB)

    • GPU: NVIDIA RTX A6000 (48GB)

I’m receiving RTSP streams from 7.4MP IP cameras, and planning to run video analytics (YOLO, etc.) using NVIDIA DeepStream for decoding + inference on GPU.

However, my engineer mentioned that the CPU can only handle up to 6 RTSP streams, after which they are passed to DeepStream for GPU decoding.

This seems like an unusual bottleneck. Given that decoding is being handled by the GPU via NVDEC, I was expecting the CPU load to be minimal, and to be able to scale to dozens of streams with this setup.


❓ My Questions:

  1. Has anyone experienced a similar CPU bottleneck during RTSP ingestion before it reaches DeepStream/NVDEC?

  2. Are there recommended ways to offload RTSP parsing / buffering / demuxing to GPU or another optimized path?

  3. Is there a better way to architect the pipeline to let GPU handle more of the preprocessing (RTSP input → decode → inference), without overloading CPU?

  4. If we want to scale up to 100 camera streams, do we really need to buy ~15+ servers, or is there a more scalable architecture?


🎯 Goal:

I want to maximize GPU utilization and minimize CPU dependency, especially during the RTSP ingestion and pre-decode steps, so that each rack server can handle 30–50 camera streams, not just 6. And the FPS at the last can be at least 25+ for all!

Any insights or best practices would be highly appreciated. Thank you!

How did the engineer get this data? Have you measured the yolo model engine’s performance for 6 streams?

RTSP protocol stack and network stack is also done with CPU, why do you think the bottleneck is caused by the parsing/buffering/demuxing?

The decoding and inferencing are done by GPU, the video data before decoding are just compressed data, the size is much smaller than decoded frame video data, it is not reasonable to do the network protocol handling inside GPU(the data packets are pieces), so it is OK to do the RTSP ES stream parsing with CPU.

Can you tell us your video streams format, resolution, framerate and bitrate? According to the A6000 hardware video decoder capability, it can only support up to 37 1080p@30fps H264 videos decoding. It is no meaning to just mention how many streams you want to handle.

Video Encode and Decode Support Matrix | NVIDIA Developer

Video Codec SDK | NVIDIA Developer

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks.