Inference Time When Using Multi Stream Multi Context in TensorRT is Slower than a Single One

Description

A clear and concise description of the bug or issue.

Environment

**TensorRT Version: 8.6.1
**GPU Type: RTX3070Ti
**Nvidia Driver Version: 3.28.0.417
**CUDA Version: 11.1
**CUDNN Version: 11.3
**Operating System + Version: Win10
**C++ Version : C++ 14
**Baremetal or Container (if container which image + tag):

Hey:
I use TensorRt to infer. net is convNext,fp32 ,batch is100, input size [100,3,128,128], and when I use one stream, the GPU utilization is around 5%. I want to accelerate it by using multiple streams and contexts. When the context is set to 6, the GPU utilization is around 80%, but the running time remains the same. multi context use one engin.

The problem

  1. Why is the GPU utilization so low for batch size 100?
  2. Why don’t multi stream and multi context improve efficiency?

Thanks.

`