• Hardware Platform (Jetson / GPU)
Nvidia Tesla T4 GPU
• DeepStream Version
• TensorRT Version
• NVIDIA GPU Driver Version (valid for GPU only)
• Issue Type( questions, new requirements, bugs)
I’m using DeepStream 5.1 for batch processing of non-live HLS video. The aim is to have the highest throughput possible.
The Gstreamer pipeline contains one nvstreammux element for batching, and three nvinfer elements. One primary detector and two secondary models which operate on the predicted bounding boxes of the primary model.
I’ve tried many different combinations of batch sizes for all models, however, the increase in speed was not that big. I settled on 8 for the detector and 256 for the secondary models. With this I can achieve 184 FPS on my test videos. However, just running the detector on its own achieves 444 FPS.
My current hypothesis of where the main bottlenecks are is in the transfer of batches between the nvinfer elements. The secondary models process each batch going into the primary model. There is no reforming of ideal sized batches. As the number of detected objects differs from frame to frame we are very likely to barely ever hit the optimal batch size for the secondary models.
Is it possible to make use of the nvstreamdemux element and maybe a queue in between nvinfer elements to decouple the nvinfer elements and feed the ideal batch size to the secondary models?
Is there anything else I need to consider for optimising batch processing with multiple models in DeepStream?