Inference Time When Using Multi Stream in TensorRT is Much Slower than a Single One

Description

A clear and concise description of the bug or issue.

Environment

TensorRT Version: 7.2.3
GPU Type: Tesla T4
Nvidia Driver Version:440.44
CUDA Version: 10.2
CUDNN Version: 8.0
Operating System + Version: Centos7
Python Version (if applicable): 3.6
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.6.0
Baremetal or Container (if container which image + tag):

Hey:
When I am using multi stream which has an single context ,the inference speed is much slower than a single stream. I used nvprof to observe the gpu trace,each stream is executed alternately,not the parallel effect I want.
The question is if the multi stream are executed serially on the GPU. And how can i get the fastest speed if i have several engine that can inference meanwhile.

Hi,

You may need to create multiple IExecutionContext and assign each IExecutionContext with 1 CUDA stream when calling enqueueV2. They are independent of each other.

BTW, we also have a Triton server built on top of Tensorrt, it has a built-in schedule to handle multiple engines/models.

Thank you.

I do create multiple IExecutionContext and assign each IExecutionContext with single CUDA stream, but the result is what I described . However, when I bind each IExecutionContext with different GPU, the inference time is close to that of a single stream.
Thanks.