Batching vs CUDA Streams for concurrent inferences?

lavinan26 · April 12, 2020, 9:08pm

For a video analytics case where 100s of video feeds will arrive and inference needs to be done for all of them, is it better to go with batching for the frames of every video feed OR create 100s of execution context instances with 100s of CUDA streams to achieve the most effective concurrency? Which one will be faster? Will be great if someone can explain me.

SunilJB · April 13, 2020, 6:34am

Hi,
It is a trade off. For non-real time usage if you only care about handling all frames. I will recommend to use relatively large batch.
But if you care about the latency of each frame, in that case multi-stream might be the best way.

Please refer below link for best practices:
https://docs.nvidia.com/deeplearning/sdk/tensorrt-best-practices/index.html

For working with video stream i will recommend you to use deepstream SDK:

Thanks

lavinan26 · April 13, 2020, 6:55am

First of all thank you so much for your valuable response. In my case it is for real time video analytics where we need high throughput and also less latency. Isn’t it a good idea to combine both multiple CUDA Streams and batching together? (Here what i mean as a batch is, the collection of frames we get from each camera in real time at a given point of time)
For example if 128 cameras are connected, having 4 CUDA Streams and in each doing inference for the frames came from 32 cameras as a batch (or any batch size that will not interrupt the real time output generation) . Or will having 128 separate CUDA streams each dedicated for every camera with no batching at all give us better overall throughput?

SunilJB · April 17, 2020, 7:30am

Hi,
Latency of each frame and overall perf might be a trade off. But you can try both to see what could be better in your case.
Multi-stream can be useful if single stream is not using complete GPU compute. But increase batch to fully use GPU resource might have better performance.

Thanks

lavinan26 · April 17, 2020, 7:41am

One last thing i would like to know is, if we use deepstream can we make multiple CNNs to perform inference in parallel on a camera stream or does it perform inference in series one after the other? What i need to know is whether it will utilize the GPU to the fullest using the best way to perform
lots of inference in real time or there is any advantage when we design the application for inference in our own using directly using TensorRT?
Thank you.

SunilJB · April 17, 2020, 4:46pm

Hi,
Deepstream has capabilities to run multiple streams concurrently.
The number of concurrent streams probably depends on the size of model vs. available compute.

Please refer to below link for more details:

Thanks

lavinan26 · April 18, 2020, 5:39am

Thank you so much for your clarifications

Topic		Replies	Views
Multiple concurrent Execution Contexts? TensorRT tensorrt	6	1732	February 14, 2022
Is multi threaded execution possible with tensorRT? TensorRT	3	2227	April 13, 2020
Batch processing multiple camera streams in parallel TensorRT	1	1032	December 16, 2019
Performance Comparison: Multiple CUDA Streams with Multiple TensorRT Models vs. Combining Multiple TensorRT Models TensorRT tensorrt , cuda	0	377	December 23, 2023
Multithread does not improve inference performance with tensorrt models TensorRT tensorrt	2	1170	May 11, 2021
Use multiple CUDA streams with multiple TensorRT models Jetson AGX Orin tensorrt , cuda	3	390	December 26, 2023
Batch inference parallelization on tensorrt TensorRT tensorrt , cuda	5	951	May 5, 2021
Synchronized inference or Asynchronized inference TensorRT	1	4360	December 5, 2018
Speedup by increasing # of streams vs. batch size TensorRT	2	692	June 23, 2022
Batch inference parallelization on tensorrt DeepStream SDK tensorrt	2	476	October 12, 2021

Batching vs CUDA Streams for concurrent inferences?

Related topics