Batch-size has marginal affect on multi-source performance

• Hardware Platform (Jetson / GPU) - Jetson TX2
• DeepStream Version - 5.0
• JetPack Version (valid for Jetson only) - 4.4, L4T 32.4.3
• TensorRT Version - 7.1.3
• Issue Type( questions, new requirements, bugs) - question

Hello,

I am currently working to benchmark multi-source performance of our Deepstream 5.0 application on a Jetson TX2 device. Per the Deepstream best practices, I can see that we should be setting the batch-size parameter equal to the number of input sources to increase performance. We’re using the base pruned trafficcamnet model with RTSP inputs (20fps) and are setting the batch-size parameter in [streammux] and [primary-gie] groups and are seeing the following results:

Batch Size # of streams average FPS (0) average FPS (1) average FPS (2) average FPS (3)
1 1 19.97
1 2 17.19 17.17
1 3 11.13 11.09 11.04
1 4 8.19 8.13 8.05 8.01
2 1 19.96
2 2 18.95 18.95
2 3 12.47 12.44 12.39
2 4 9.2 9.15 9.09 9.07
3 1 19.95
3 2 17.34 17.27
3 3 12.96 12.96 12.96
3 4 9.64 9.62 9.55 9.54
4 1 19.57
4 2 17.74 17.74
4 3 12.07 12.06 12.06
4 4 9.76 9.76 9.76 9.76

We are seeing some performance increase when matching the batch size to the number of input streams, but this increase is not very significant. This is running inside of a Docker container with all sinks set to fakesink.

My question is: is this marginal performance increase expected, or am I missing something and not taking full advantage of batching?

What’s the streammux batch size in your table, is it the same as pgie?

also can you test the model fps via trtexec?

Hi @bcao , thanks for the response!

Yes, the batch size column in the table above represents the batch size set in both the streammux and the pgie groups.

Regarding testing the model using trtexec, I’m not seeing any option for tlt models (as I am using trafficcamnet) and per this post, trtexec doesn’t support tlt/etlt models.

Is there any other tools that we could use to benchmark the model outside of our application?
What is the expected performance increase when adjusting batch-size to match the number of input sources? Can you confirm whether or not the small performance increase we’re seeing is expected, or if there should be a greater difference?

I rechecked your table , it should be expected since your source stream is 20 fps, which part do you think is not expected?

Hi @bcao,

We thought that increasing the batch-size (specifically to match the number of streams) would result in a greater performance increase than 1-2FPS. I do see that the performance (FPS) for each number of streams in my table above (1-4) is greatest when batch size == number of streams, but I’d expect batching to have a greater effect on performance.

However, if you think that 1-2FPS is the expected performance increase when changing batch-size then I can try to explore other ways to increase the multi-stream performance of our Deepstream app.
Do you have any recommendations to increase pipeline FPS outside of what’s explicitly mentioned in Deepstream best practices?

  1. pelase move to latest version of DS.
  2. Increasing batch size will not help as probably GPU is saturated - user can check it via tegrastats utility. Try enabling max clock setting. Perf will be model dependent also. checked perf numbers we have reported - trafficcamnet - https://docs.nvidia.com/metropolis/deepstream/dev-guide/text/DS_Performance.html.
  3. also can you please share your config files for us to check