Speedup by increasing # of streams vs. batch size

Description

I experimented with speedup by increasing the number of streams or batch size.
I expected enough speedup in both cases.
But, there is no significant speedup by increasing the number of streams.
Multi-stream is faster than sequential processing.
Even multi-stream reduces the memory transfer time of input image by pipelining.
Do you think this result is normal??

Environment

TensorRT Version: 8.2.1.8
GPU Type: T4
Nvidia Driver Version: 470.63.01
CUDA Version: 10.2
CUDNN Version: 8.2.4.15
Operating System + Version: Ubuntu 18.04.6 LTS

Image Size: 960x604
Network Model: SSD

Hi,

The below links might be useful for you.
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#thread-safety

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html

For multi-threading/streaming, will suggest you to use Deepstream or TRITON

For more details, we recommend you raise the query in Deepstream forum.

or

raise the query in Triton Inference Server Github instance issues section.

Thanks!

I try it again with seperate context (nvinfer1::IExecutionContext) for each stream.
But the execution time shows similar pattern. (no significant speedup by increasing the number of streams)
Do I need to create anything more seperately?
@NVES