Batching vs CUDA Streams for concurrent inferences?

For a video analytics case where 100s of video feeds will arrive and inference needs to be done for all of them, is it better to go with batching for the frames of every video feed OR create 100s of execution context instances with 100s of CUDA streams to achieve the most effective concurrency? Which one will be faster? Will be great if someone can explain me.

1 Like

Hi,
It is a trade off. For non-real time usage if you only care about handling all frames. I will recommend to use relatively large batch.
But if you care about the latency of each frame, in that case multi-stream might be the best way.

Please refer below link for best practices:
https://docs.nvidia.com/deeplearning/sdk/tensorrt-best-practices/index.html

For working with video stream i will recommend you to use deepstream SDK:

Thanks

First of all thank you so much for your valuable response. In my case it is for real time video analytics where we need high throughput and also less latency. Isn’t it a good idea to combine both multiple CUDA Streams and batching together? (Here what i mean as a batch is, the collection of frames we get from each camera in real time at a given point of time)
For example if 128 cameras are connected, having 4 CUDA Streams and in each doing inference for the frames came from 32 cameras as a batch (or any batch size that will not interrupt the real time output generation) . Or will having 128 separate CUDA streams each dedicated for every camera with no batching at all give us better overall throughput?

Hi,
Latency of each frame and overall perf might be a trade off. But you can try both to see what could be better in your case.
Multi-stream can be useful if single stream is not using complete GPU compute. But increase batch to fully use GPU resource might have better performance.

Thanks

One last thing i would like to know is, if we use deepstream can we make multiple CNNs to perform inference in parallel on a camera stream or does it perform inference in series one after the other? What i need to know is whether it will utilize the GPU to the fullest using the best way to perform
lots of inference in real time or there is any advantage when we design the application for inference in our own using directly using TensorRT?
Thank you.

Hi,
Deepstream has capabilities to run multiple streams concurrently.
The number of concurrent streams probably depends on the size of model vs. available compute.

Please refer to below link for more details:

Thanks

Thank you so much for your clarifications