Parallel execution of several trt contexts on one GPU

Description

I’m running AI inference on video frames. Using RTX 4090 GPU and trt engine for ONNX models. C++. In order to accelerate all this I’m trying to run inference in parallel on one gpu using several cpu threads. In each cpu thread an absolutely new trt context is created which is using it’s own separate cuda stream.
As a result I see if say I launched 3 cpu threads then used gpu resource is tripled as it actually should be. And now we came to a problem. No fps increase is observed. Performance is exactly the same as if I were running just one single trt context.
But if I add another 2 4090 devices and do the same - like 3 trt contexts in parallel on 3 different devices then all is fine. My fps is tripled and all is working as intended.
So the problem is like even if I’m running on one gpu device several independent trt contexts, they are still executing kind of sequentially and not parallel

Environment

TensorRT Version: 8.5.2.2
GPU Type: RTX 4090 but applicable to all GPU types
Nvidia Driver Version: 536.67
CUDA Version: 11.8
CUDNN Version: 8.6.0.163
Operating System + Version: Windows 10-11
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Hi,

The below links might be useful for you.

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html

For multi-threading/streaming, will suggest you to use Deepstream or TRITON

For more details, we recommend you raise the query in Deepstream forum.

or

raise the query in Triton Inference Server Github instance issues section.

Thanks!