• Hardware Platform (Jetson / GPU): RTX 3090 24GB
• DeepStream Version: 5.1
• TensorRT Version: 7.2.3-1+cuda11.1
• NVIDIA GPU Driver Version (valid for GPU only): 460.32.03
• Issue Type( questions, new requirements, bugs)> questions
yolov4-tiny
I test a performance on yolov4-tiny model with deepstream-app on my RTX 3090. When I use engine model with batch-size=4 with 4 input streams, the FPS for each stream is around 210FPS and GPU utilization is around 30%. So it means that I can get much more.
If I create new engine model of yolov4-tiny with batch-size=8 and test it with 8 input streams, the FPS for each stream is around 105FPS (It is a half of the previous case.) and GPU utilization is still around 30%.
Next, instead of larger batch I run two deepstream-apps with batch-size=4 engine from the first case and with 4 input streams for both deepstram-apps, but again the FPS for each stream was around 105FPS and GPU utilization is still around 30%.
yolov4-mish
All this I did again for yolov4-mish (which is larger than yolov4-tiny).
Shortly:
For batch-size=4, input-streams=4, FPS=~28 for each stream, GPU-util=~45%
For batch-size=8, input-streams=8, FPS=~14 for each stream, GPU-util=~45%
For two deepstream apps, batch-size=4, input-streams=4 for both apps, FPS=~28 for each stream for both apps!!!, GPU-util=~98%
How can I force the GPU to use all available performance? In case of yolov4-mish, why two deepstream-apps with batch-size=4 engine are able to use 99% GPU perf? But one deepstream-app with batch-size=8 engine is not?
(Both engines were created with fp16 precision)