Performance drop when using multiple sources


When I run with the config file loading resnet10, I get 12 streams at 25fps which makes 12x25 = 300fps > 54fps (and I think I can still increase this number). So, how this theoretical max fps is computed?

about resnet10, are you also using fp32 accuracy? if using different models, the theoretical max fps is different. you need to test resnet10’s theoretical max fps. please refer to the comment on Apr 24.

Yes, it was also fp32. But my point is how this theoretical max fps is computed. It depends on the GPU model, the prediction model, and many other things. How do I compute this theoretical max fps?

please refer to my comments on April 24,25,26. using trtexec to test engine can get a theoretical max fps( the value of Throughput:). this test only includes inference part and does not include video decoding, osd or other processing. DeepStream pipeline includes video decoding, nvstreammux, inference, osd and other processing. it is more complex, the max fps of pipeline is close to the theoretical max fps got by the trtexec testing.

When running /usr/src/tensorrt/bin/trtexec --loadEngine=/home/ubuntu/EdgeServer/model_b4_gpu0_fp32.engine --fp16, sometimes I get Throughput: 54.7398 qps, sometimes Throughput: 43.9396 qps. Why such a variance? Is it possible to narrow it?

trtexec uses random input data, and please make sure no other applications are using the GPU. you can run many times, then get an average value. Here is my test log.txt (1.8 KB). the values are different but fluctuate around 1030.

Thanks so much @fanzh for your comments, they are enlightening. In order to help other developers I summarized the solution which is spread all over this post, but I encourage everybody to read the full post.

  1. Be sure batch-size number at config file matches with the number of sources.
  • I may useful if your model support a dynamic number of batches, instead of a static number
  • See, at lines 332-335 for setting it automatically
  1. Set network-mode to 0 (fp32), 1(int8) or 2 (fp16) to improve performance. I am using 32bits float, but you may get much better results if you change to fp16 and int8 It’s tricky since you increase performance for the price of reducing precision (very small reduce, but it do exists)
  1. The theoretical maximum fps from you engine model is obtained by trtexec --loadEngine=your_saved_saved.engine --fp16, see Throughput at the end of console output. It is not “theoretical”, it is rather the “upper limit”, but anyway, its the maximum value you get from any pipeline. For instance, if you get Throughput=60 qps, and you intend to have 4 sources, the maximum performance for such configuration is 60/4= 15fps each source.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.