Speedup by increasing # of streams vs. batch size


I experimented with speedup by increasing the number of streams or batch size.
I expected enough speedup in both cases.
But, there is no significant speedup by increasing the number of streams.
Multi-stream is faster than sequential processing.
Even multi-stream reduces the memory transfer time of input image by pipelining.
Do you think this result is normal??


TensorRT Version:
GPU Type: T4
Nvidia Driver Version: 470.63.01
CUDA Version: 10.2
CUDNN Version:
Operating System + Version: Ubuntu 18.04.6 LTS

Image Size: 960x604
Network Model: SSD


The below links might be useful for you.


For multi-threading/streaming, will suggest you to use Deepstream or TRITON

For more details, we recommend you raise the query in Deepstream forum.


raise the query in Triton Inference Server Github instance issues section.


I try it again with seperate context (nvinfer1::IExecutionContext) for each stream.
But the execution time shows similar pattern. (no significant speedup by increasing the number of streams)
Do I need to create anything more seperately?