Speeding up multi-threaded C++ program of TensorRT models

Description

I have 4 TensorRT models running on 4 threads in C++. When I run these models in single-threaded program, the FPS really good. However, when I run model in the multi-threaded program, the FPS of model reduce 3 times. I checked the usage of GPU, it is low ~45%. In each thread that runs the tensorrt model, the steps are:

cudaMemcpyAsync(H->D)
enqueueV2()
cudaMemcpyAsync(D->H)

Each thread has its own context, stream to run the tensorrt model.

  1. What should I do to increase the percentage of GPU usage with multi-threaded program (note that I already inference the models with batch) ?

  2. As I read from MPS document, work launched from work queues belonging to the same cuda context can execute concurrently on the GPU and can not execute concurrently if they belong to different cuda context. My question is can I create only one cuda context and share it to all the threads to have more chances to run models concurrently?

Environment

TensorRT Version: 8.4.2.4
GPU Type: 3060
Nvidia Driver Version:510
CUDA Version: 470.161.03
CUDNN Version: 8.5.0.96
Operating System + Version: Ubuntu 20.04

Hi,

The below links might be useful for you.
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#thread-safety

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html

For multi-threading/streaming, will suggest you to use Deepstream or TRITON

For more details, we recommend you raise the query in Deepstream forum.

or

raise the query in Triton Inference Server Github instance issues section.

Thanks!