Performance drop when using multiple sources

Please provide complete information as applicable to your setup.
• Hardware Platform (Jetson / GPU) Jetson Xavier AGX
• DeepStream Version 6.3.0
• JetPack Version (valid for Jetson only) 5.1
• TensorRT Version 8.5.2.2

I know AGX Xavier supports 52x 1080p30 (H.265), and you can even use deepstream-test3-app with several sources that you hardly see FPS being dropped. But in my script there is some kind of bottleneck, or even a synchronous behavior, that is reducing the FPS, almost 20% every source is added.

My source is an RTSP stream 1920x1080@25fps. Using 1 uridecodebin as a source bin, I get 24.2fps. With 2 source bins I get 18.2fps, with 3 source bins I get13.4fpf. And this decreases on and on as I increase the number of source bins. And it shouldn’t because AGX Xavier is capable to keep 24-25fps for many inputs like that. So I am missing something.

My pipeline goes like this:

uridecodebin [1..N] -> nvstreammux -> nvinfer -> nvtracker  -> nvtee ->  nvstreamdemux -> nvvideoconv[1..N] -> nvosd[1..N]

I read some posts on queue and how it can provide asynchronous behavior. So I added a queue between every element, just like this:

uridecodebin [1..N] -> nvstreammux -> queue1 -> nvinfer ->  queue2 -> nvtracker  ->  queue3 -> nvtee ->   queue4 -> nvstreamdemux ->  queue_demux_conv[1..N]-> nvvideoconv[1..N] ->  queue_conv_osd[1..N] -> nvosd[1..N]

If queues were an issue, there must be something more, because it made no difference to the decaying performance. Thus, I undid the inclusion of queues and I am back to the original pipeline. Does any one can give me any tip about what’s wrong? This is my script
flavio_forum.py.txt (22.5 KB)
.

Thanks in advance.

why do you comment out “pgie.set_property(“batch-size”, number_sources)”? if using multiple sources, the engine batch-size should be updated accordingly.
please refer to deepstream_test_3.py. which has a similar media pipeline.
please refer to this topic for performance improvement.

batch-size=1 because “Backend has maxBatchSize 1 whereas 3 has been requested”. I am using Yolov5. I downloaded the yolov5s.pt and converted it to .onnx, and this onnx is the one being loaded.

if commenting out “pgie.set_property(“batch-size”, number_sources)”, nvinfer will use fixed batch-size1 engine. it is not reasonable. if using multiple sources, nvinfer need to use a batch-size >1 engine for a higher performance.

This is clear to me, but my point is the error (“Backend has maxBatchSize 1 whereas 3 has been requested”). The exporting from .pt to .onnx was done with --dymanics, which mean that the model should support a dynamic size of batches. So, when I use batch-size=3, for instance, the pipeline gives me such error, even though the model should support it.

could you share a whole log and nvinfer configuration file?

I managed to export .onnx using dynamic axes and batch-size in config can be set to any number >=1, no errors anymore. Now, I am back to the original problem that is the performance drop. Using 3 RTSP stream 1920x1080@25fps, pgie and streammux with batch-size=3, I still get 16fps for each stream. I should be 24-25fps, and the value goes down whenever I increase the number of sources.

This is the config
config_infer_primary_yoloV5.txt (984 Bytes)

And this is the log
run.log.tar.gz (45.7 MB)

  1. please refer to the code of deepstream-test3. please make sure the batch-size of nvstreammux and nvinfner is the same with the number_sources.
  2. if want high fps, you can use fakesink.
  3. noticing using deeptream-test3, you can get high fps. can you modify deepstream-test3 step by step to customize? for example, test “nvinfer + fakesink” first, then test “nvinfer + nvtracker+ fakesink”, then test the application with other elements.
  1. batch-size for nvstreammux and nvinfer are the same, set to 3, and I am using 3 RTSP stream sources.

  2. The performance logger doesn’t show any difference between fakesink and nv3dsink, and this seems to me that the bottleneck is not the sync (by now).

  3. Now, I created a new pipeline with the same elements from deepstream-test3 with and additional tracker element, and linked them exactly the same. However, I still get no more than 13.4fps/each for 3 sources 1920x1080@25fps. I have removed the osd buffer probe, just in case of any interference. This is the script now
    flavionew.py.txt (22.0 KB)

there is too many custom codes in flavionew.py.txt. it is too hard to directly find the root cause for low fps issue. here is “nvinfer + fakesink” piepline base on deepstream_test3. deepstream_test_3.py (16.7 KB), if it can run with a high fps, then you can continue to add other elements step by step.

I managed use deepstream_test_3.py as you advised, and finally got the same results from test 3 script. Both flavionew.py and deepstream_test_3.py. For both script I change the prediction model by changing the config.txt file. This are the benchmarks I get:

Using Resnet10 and config file deepstream_app_config.txt (1.1 KB):
deepstream_test_3.py **PERF: {‘stream0’: 24.96, ‘stream1’: 24.96, ‘stream2’: 24.96}
flavionew.py **PERF: {‘stream0’: 24.97, ‘stream1’: 24.97, ‘stream2’: 24.97}

Using Yolov5s and config file config_infer_primary_yoloV5.txt (1.1 KB):
deepstream_test_3.py **PERF: {‘stream0’: 14.85, ‘stream1’: 14.85, ‘stream2’: 14.85}
flavionew.py **PERF: {‘stream0’: 15.99, ‘stream1’: 15.99, ‘stream2’: 15.99}

I know Yolov5 is much more complex and has much more layers than Resnet10, and thus the former is slower than the latter. However, as far as I know, my hardware can easily support 24.xxxfps for Yolo (https://github.com/NVIDIA-AI-IOT/jetson_benchmarks#for-jetson-xavier-nx). So, I am back to the original question of this post to find out a solution for this performance drop. Where can I find Nvidia approach for that?

About yolov5s test, why do you need to set network-mode=0, fp32 accuracy has worse performance than fp16 or int8. please refer to this configuration.

In fact, fp16 will increase frame rate, according to my experience, about 1.9x to 2.2x times. And int8 also increases it, but it will be necessary a calibration table, which I don’t have.

I use fps32 for benchmark purpose, since its a baseline for comparing models performance in many hardware, so I can’t change network-mode. Otherwise, I would be comparing different things.

What I see in jtop, is the model running in GPU with 68% of usage, thus with an opportunity to fill the processing pipeline with more instructions. I guess there is something that still need to done to improve the fps.

noticing you are using rtsp source with fps25. the max fps of pipeline should be close to 25. please test fp16 accuracy and please refer to this link for how to create the int8 calibration file.

I am sure fp16 will improve performance, but I can’t change network-mode to 2. This solution of changing it to fp16, ou even int8, just cloaks the problem, the GPU must perform at fp32.

Note that GPU is being used at 68%, so there is plenty idle processing that needs to be used. How can we identify the pipeline agressor?

please refer to this topic. if testing two streams with fp32 accuracy, please share the log of “trtexec --loadEngine=saved.engine --fp16”.

So, this is the log for ubuntu@ubuntu:~/EdgeServer$ /usr/src/tensorrt/bin/trtexec --loadEngine=/home/ubuntu/EdgeServer/model_b4_gpu0_fp32.engine --fp16:

log.txt (9.0 KB)

thanks for the sharing! from the log, the theoretical max fps of inference is 54, if using two streams, the theoretical max fps of each stream should be 25. you can use “src-> pgie → fakesink” to verify. I have provide the code on Apr 16.

So, running the pipeline with src->streammux->queue->pgie->fakesink (python3 deepstream_test_3.py --silent --no-display -i rtsp://admin:hbyt12345@10.21.45.19 rtsp://admin:hbyt12345@10.21.45.19 rtsp://admin:hbyt12345@10.21.45.19 rtsp://admin:hbyt12345@10.21.45.19)
I get:
****PERF: {‘stream0’: 15.19, ‘stream1’: 15.19, ‘stream2’: 15.19}
****PERF: {‘stream0’: 15.6, ‘stream1’: 15.6, ‘stream2’: 15.6}
****PERF: {‘stream0’: 15.6, ‘stream1’: 15.6, ‘stream2’: 15.6}
****PERF: {‘stream0’: 15.39, ‘stream1’: 15.39, ‘stream2’: 15.39}
****PERF: {‘stream0’: 15.59, ‘stream1’: 15.59, ‘stream2’: 15.59}
****PERF: {‘stream0’: 15.59, ‘stream1’: 15.59, ‘stream2’: 15.59}
****PERF: {‘stream0’: 15.39, ‘stream1’: 15.39, ‘stream2’: 15.39}
****PERF: {‘stream0’: 15.6, ‘stream1’: 15.6, ‘stream2’: 15.6}
****PERF: {‘stream0’: 15.58, ‘stream1’: 15.58, ‘stream2’: 15.58}
****PERF: {‘stream0’: 15.4, ‘stream1’: 15.4, ‘stream2’: 15.4}
****PERF: {‘stream0’: 15.58, ‘stream1’: 15.58, ‘stream2’: 15.58}

noticing you are testing with four streams, from the logs, the total fps should be about 15x4=60(each stream fps x stream number). it is close to the theoretical max fps 54.