Performance drop when using multiple sources

flavio.mello · April 9, 2024, 8:43pm

Please provide complete information as applicable to your setup.
• Hardware Platform (Jetson / GPU) Jetson Xavier AGX
• DeepStream Version 6.3.0
• JetPack Version (valid for Jetson only) 5.1
• TensorRT Version 8.5.2.2

I know AGX Xavier supports 52x 1080p30 (H.265), and you can even use deepstream-test3-app with several sources that you hardly see FPS being dropped. But in my script there is some kind of bottleneck, or even a synchronous behavior, that is reducing the FPS, almost 20% every source is added.

My source is an RTSP stream 1920x1080@25fps. Using 1 uridecodebin as a source bin, I get 24.2fps. With 2 source bins I get 18.2fps, with 3 source bins I get13.4fpf. And this decreases on and on as I increase the number of source bins. And it shouldn’t because AGX Xavier is capable to keep 24-25fps for many inputs like that. So I am missing something.

My pipeline goes like this:

uridecodebin [1..N] -> nvstreammux -> nvinfer -> nvtracker  -> nvtee ->  nvstreamdemux -> nvvideoconv[1..N] -> nvosd[1..N]

I read some posts on queue and how it can provide asynchronous behavior. So I added a queue between every element, just like this:

uridecodebin [1..N] -> nvstreammux -> queue1 -> nvinfer ->  queue2 -> nvtracker  ->  queue3 -> nvtee ->   queue4 -> nvstreamdemux ->  queue_demux_conv[1..N]-> nvvideoconv[1..N] ->  queue_conv_osd[1..N] -> nvosd[1..N]

If queues were an issue, there must be something more, because it made no difference to the decaying performance. Thus, I undid the inclusion of queues and I am back to the original pipeline. Does any one can give me any tip about what’s wrong? This is my script
flavio_forum.py.txt (22.5 KB)
.

Thanks in advance.

fanzh · April 10, 2024, 8:12am

why do you comment out “pgie.set_property(“batch-size”, number_sources)”? if using multiple sources, the engine batch-size should be updated accordingly.
please refer to deepstream_test_3.py. which has a similar media pipeline.
please refer to this topic for performance improvement.

flavio.mello · April 10, 2024, 12:23pm

batch-size=1 because “Backend has maxBatchSize 1 whereas 3 has been requested”. I am using Yolov5. I downloaded the yolov5s.pt and converted it to .onnx, and this onnx is the one being loaded.

fanzh · April 11, 2024, 5:44am

if commenting out “pgie.set_property(“batch-size”, number_sources)”, nvinfer will use fixed batch-size1 engine. it is not reasonable. if using multiple sources, nvinfer need to use a batch-size >1 engine for a higher performance.

flavio.mello · April 11, 2024, 1:59pm

This is clear to me, but my point is the error (“Backend has maxBatchSize 1 whereas 3 has been requested”). The exporting from .pt to .onnx was done with --dymanics, which mean that the model should support a dynamic size of batches. So, when I use batch-size=3, for instance, the pipeline gives me such error, even though the model should support it.

fanzh · April 11, 2024, 2:44pm

could you share a whole log and nvinfer configuration file?

flavio.mello · April 12, 2024, 6:06pm

I managed to export .onnx using dynamic axes and batch-size in config can be set to any number >=1, no errors anymore. Now, I am back to the original problem that is the performance drop. Using 3 RTSP stream 1920x1080@25fps, pgie and streammux with batch-size=3, I still get 16fps for each stream. I should be 24-25fps, and the value goes down whenever I increase the number of sources.

This is the config
config_infer_primary_yoloV5.txt (984 Bytes)

And this is the log
run.log.tar.gz (45.7 MB)

fanzh · April 14, 2024, 12:45pm

please refer to the code of deepstream-test3. please make sure the batch-size of nvstreammux and nvinfner is the same with the number_sources.
if want high fps, you can use fakesink.
noticing using deeptream-test3, you can get high fps. can you modify deepstream-test3 step by step to customize? for example, test “nvinfer + fakesink” first, then test “nvinfer + nvtracker+ fakesink”, then test the application with other elements.

flavio.mello · April 15, 2024, 2:11pm

batch-size for nvstreammux and nvinfer are the same, set to 3, and I am using 3 RTSP stream sources.
The performance logger doesn’t show any difference between fakesink and nv3dsink, and this seems to me that the bottleneck is not the sync (by now).
Now, I created a new pipeline with the same elements from deepstream-test3 with and additional tracker element, and linked them exactly the same. However, I still get no more than 13.4fps/each for 3 sources 1920x1080@25fps. I have removed the osd buffer probe, just in case of any interference. This is the script now
flavionew.py.txt (22.0 KB)

fanzh · April 16, 2024, 3:17pm

there is too many custom codes in flavionew.py.txt. it is too hard to directly find the root cause for low fps issue. here is “nvinfer + fakesink” piepline base on deepstream_test3. deepstream_test_3.py (16.7 KB), if it can run with a high fps, then you can continue to add other elements step by step.

flavio.mello · April 19, 2024, 1:31pm

I managed use deepstream_test_3.py as you advised, and finally got the same results from test 3 script. Both flavionew.py and deepstream_test_3.py. For both script I change the prediction model by changing the config.txt file. This are the benchmarks I get:

Using Resnet10 and config file deepstream_app_config.txt (1.1 KB):
deepstream_test_3.py **PERF: {‘stream0’: 24.96, ‘stream1’: 24.96, ‘stream2’: 24.96}
flavionew.py **PERF: {‘stream0’: 24.97, ‘stream1’: 24.97, ‘stream2’: 24.97}

Using Yolov5s and config file config_infer_primary_yoloV5.txt (1.1 KB):
deepstream_test_3.py **PERF: {‘stream0’: 14.85, ‘stream1’: 14.85, ‘stream2’: 14.85}
flavionew.py **PERF: {‘stream0’: 15.99, ‘stream1’: 15.99, ‘stream2’: 15.99}

I know Yolov5 is much more complex and has much more layers than Resnet10, and thus the former is slower than the latter. However, as far as I know, my hardware can easily support 24.xxxfps for Yolo (https://github.com/NVIDIA-AI-IOT/jetson_benchmarks#for-jetson-xavier-nx). So, I am back to the original question of this post to find out a solution for this performance drop. Where can I find Nvidia approach for that?

fanzh · April 22, 2024, 2:43pm

About yolov5s test, why do you need to set network-mode=0, fp32 accuracy has worse performance than fp16 or int8. please refer to this configuration.

flavio.mello · April 22, 2024, 8:51pm

In fact, fp16 will increase frame rate, according to my experience, about 1.9x to 2.2x times. And int8 also increases it, but it will be necessary a calibration table, which I don’t have.

I use fps32 for benchmark purpose, since its a baseline for comparing models performance in many hardware, so I can’t change network-mode. Otherwise, I would be comparing different things.

What I see in jtop, is the model running in GPU with 68% of usage, thus with an opportunity to fill the processing pipeline with more instructions. I guess there is something that still need to done to improve the fps.

fanzh · April 23, 2024, 6:02am

noticing you are using rtsp source with fps25. the max fps of pipeline should be close to 25. please test fp16 accuracy and please refer to this link for how to create the int8 calibration file.

flavio.mello · April 23, 2024, 2:36pm

I am sure fp16 will improve performance, but I can’t change network-mode to 2. This solution of changing it to fp16, ou even int8, just cloaks the problem, the GPU must perform at fp32.

Note that GPU is being used at 68%, so there is plenty idle processing that needs to be used. How can we identify the pipeline agressor?

fanzh · April 24, 2024, 9:13am

please refer to this topic. if testing two streams with fp32 accuracy, please share the log of “trtexec --loadEngine=saved.engine --fp16”.

flavio.mello · April 24, 2024, 3:14pm

So, this is the log for ubuntu@ubuntu:~/EdgeServer$ /usr/src/tensorrt/bin/trtexec --loadEngine=/home/ubuntu/EdgeServer/model_b4_gpu0_fp32.engine --fp16:

log.txt (9.0 KB)

fanzh · April 25, 2024, 6:25am

thanks for the sharing! from the log, the theoretical max fps of inference is 54, if using two streams, the theoretical max fps of each stream should be 25. you can use “src-> pgie → fakesink” to verify. I have provide the code on Apr 16.

flavio.mello · April 25, 2024, 3:23pm

So, running the pipeline with src->streammux->queue->pgie->fakesink (python3 deepstream_test_3.py --silent --no-display -i rtsp://admin:hbyt12345@10.21.45.19 rtsp://admin:hbyt12345@10.21.45.19 rtsp://admin:hbyt12345@10.21.45.19 rtsp://admin:hbyt12345@10.21.45.19)
I get:
****PERF: {‘stream0’: 15.19, ‘stream1’: 15.19, ‘stream2’: 15.19}
****PERF: {‘stream0’: 15.6, ‘stream1’: 15.6, ‘stream2’: 15.6}
****PERF: {‘stream0’: 15.6, ‘stream1’: 15.6, ‘stream2’: 15.6}
****PERF: {‘stream0’: 15.39, ‘stream1’: 15.39, ‘stream2’: 15.39}
****PERF: {‘stream0’: 15.59, ‘stream1’: 15.59, ‘stream2’: 15.59}
****PERF: {‘stream0’: 15.59, ‘stream1’: 15.59, ‘stream2’: 15.59}
****PERF: {‘stream0’: 15.39, ‘stream1’: 15.39, ‘stream2’: 15.39}
****PERF: {‘stream0’: 15.6, ‘stream1’: 15.6, ‘stream2’: 15.6}
****PERF: {‘stream0’: 15.58, ‘stream1’: 15.58, ‘stream2’: 15.58}
****PERF: {‘stream0’: 15.4, ‘stream1’: 15.4, ‘stream2’: 15.4}
****PERF: {‘stream0’: 15.58, ‘stream1’: 15.58, ‘stream2’: 15.58}

fanzh · April 26, 2024, 7:46am

noticing you are testing with four streams, from the logs, the total fps should be about 15x4=60(each stream fps x stream number). it is close to the theoretical max fps 54.

Topic		Replies	Views
Increase the FPS DeepStream SDK	25	1363	April 17, 2024
[urgent] FasterRCNN example - running slowly DeepStream SDK	5	910	October 12, 2021
Deepstream yolov4 process multiple streams is slow DeepStream SDK	7	1386	November 30, 2021
Deepstream 4 + yolov3 multi source slow DeepStream SDK	9	1817	October 12, 2021
Frame rate drops with more than 5 RTSP streams on a single GPU DeepStream SDK	7	1682	October 12, 2021
Playing 25 parallel streams casusing issue DeepStream SDK gstreamer	12	467	October 8, 2021
Deepstreamer Pipeline: Optimisation GPU Utilisation DeepStream SDK gstreamer , fps , deepstream	22	122	December 12, 2024
Too much frame drop in deepstream pipeline DeepStream SDK cuda , jetson-inference , gstreamer , deepstream	20	164	February 12, 2025
Dropping the FPS of input rtsp streams DeepStream SDK rtsp , fps , deepstream	3	45	March 10, 2025
Problem using gstreamer with opencv and cuda on TX2 with Jetpack 4.3 Jetson TX2 rtsp , opencv , cuda , gstreamer	21	3351	October 18, 2021

Performance drop when using multiple sources

Related topics