Please provide complete information as applicable to your setup.
• Hardware Platform (Jetson / GPU) GPU
• DeepStream Version 6.2
• TensorRT Version 22.214.171.124
• NVIDIA GPU Driver Version (valid for GPU only) 525.125.06
• Issue Type( questions, new requirements, bugs)
I have a question regarding source decoding in deepstream. I have 4K videos (3840x2160) and a 4GB GPU (GTX 1650). I can launch up to 8 sources with 4K resolution, but adding more causes deepstream to crash due to insufficient memory. In addition, event though I am able to launch 8 sources, this is with no PGIE or SGIE in the pipeline, only sources and streammux. The performance of the pipeline is just 18 FPS.
- Why does decoding use all GPU memory, not leaving any flexibility for more sources and only very limited flexibility with other plugin usage?
- How come in retail 100’s of streams (e.g. camera monitoring) are decoded at once with CPU usage only? Are there any recommendations to reduce decoding resource usage, like using ffmepg to cut ROI from stream and create a new stream and pass it to deepstream?
Why does decoding use all GPU memory, not leaving any flexibility for more sources and only very limited flexibility with other plugin usage?
GPU-accelerated decoders require the the buffer on GPU memory for the operation. This is simply how it works. If you find that memory is the limiting factor rather than processing power, you can switch to using CPU decoders like ‘avdec_h264.’ However, this will result in slower performance, and you mentioned that you are already running the pipeline at 18fps with HW decoders, which is already slow.
How come in retail 100’s of streams (e.g. camera monitoring) are decoded at once with CPU usage only? Are there any recommendations to reduce decoding resource usage, like using ffmepg to cut ROI from stream and create a new stream and pass it to deepstream?
We have an inference server based on the T4, which is a significantly larger board. We have limited it to 32 streams of 1080p@30fps to avoid running out of memory. I’m not aware of any hardware that can support hundreds of 4K streams, except by using multiple instances on a cloud computing service. Deep learning models are typically trained with small images, and the preprocessing usually involves significant downscaling of the inputs. I would recommend using a lower resolution as the input; modern RTSP cameras often provide streams in various resolutions. You can then upscale the detections and apply them to the original 4K stream if needed.
You can also define an ROI, but that doesn’t assist the encoder; it will still require a significant amount of time to copy the buffer to the GPU and decode it. This is because you need a decoded buffer to get the ROI from it. It might be more effective to check if you can define an ROI on the producer side of the RTSP stream.
The decoder limitation is listed in Video Code SDK | NVIDIA Developer
The GTX 1650 is consumer card. It is less than half of T4’s decoding capability. For 3840x2160@30fps h264 videos, the limitation may be less than 4.2 streams.
Please choose proper GPU product according to your requirement.
Thank you for the insights, it was useful.
I am still trying to understand the difference in hardware utilization when using 1 vs multiple sources.
For example, this is the output of
nvidia-smi dmon when using 1 source 4K (OD batch size 8, since I am using 8 ROIs per source) (note: deepstream was launched midway through the logs):
Using 6 sources 4K (OD batch size 48):
Why does the utilization of both memory and decoder drops when using more sources? Why can’t the decoder be utilized more? Wouldn’t it increase the throughput of the pipeline (talking about FPS)?
There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks
What is your pipeline? What are the parameters you set?
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.