A clear and concise description of the bug or issue.
TensorRT Version: 7.2.3
GPU Type: Tesla T4
Nvidia Driver Version:440.44
CUDA Version: 10.2
CUDNN Version: 8.0
Operating System + Version: Centos7
Python Version (if applicable): 3.6
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.6.0
Baremetal or Container (if container which image + tag):
When I am using multi stream which has an single context ，the inference speed is much slower than a single stream. I used nvprof to observe the gpu trace，each stream is executed alternately，not the parallel effect I want.
The question is if the multi stream are executed serially on the GPU. And how can i get the fastest speed if i have several engine that can inference meanwhile.