Speeding up multi-threaded C++ program of TensorRT models

Description

I have 4 TensorRT models running on 4 threads in C++. When I run these models in single-threaded program, the FPS really good. However, when I run model in the multi-threaded program, the FPS of model reduce 3 times. I checked the usage of GPU, it is low ~45%. In each thread that runs the tensorrt model, the steps are:

cudaMemcpyAsync(H->D)
enqueueV2()
cudaMemcpyAsync(D->H)

Each thread has its own context, stream to run the tensorrt model.

  1. What should I do to increase the percentage of GPU usage with multi-threaded program (note that I already inference the models with batch) ?

  2. As I read from MPS document, work launched from work queues belonging to the same cuda context can execute concurrently on the GPU and can not execute concurrently if they belong to different cuda context. My question is can I create only one cuda context and share it to all the threads to have more chances to run models concurrently?

Environment

TensorRT Version: 8.4.2.4
GPU Type: 3060
Nvidia Driver Version:510
CUDA Version: 470.161.03
CUDNN Version: 8.5.0.96
Operating System + Version: Ubuntu 20.04

Hi,

The below links might be useful for you.
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#thread-safety

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html

For multi-threading/streaming, will suggest you to use Deepstream or TRITON

For more details, we recommend you raise the query in Deepstream forum.

or

raise the query in Triton Inference Server Github instance issues section.

Thanks!

I am searching this question for days and still get no working example. The problem is that I wanna deploy the tensorrt engine in C++ and for multi RTSP inference, so what i do now is using the multithreading inference. But here is the pain: when running in one thread, the time consumed per frame is like 10-20ms,but for 2/3/4/5 threads, it will increase the time like the number of threads(approx.).
p.s. i have tried using same( or diffrerent, i tried many options) runtime/engine for multithreads, and i confirm that for memcpy i have used the paged_lock mem by cudaHostAlloc. all these trials are not helpful!!!

my question is: is it even possible to use multithreads inference with c++ for tensorrt project? if the answer is yes, then how(or is there any working examples?)

besides, i have test on my laptop 3060 GPU and a desktop 4090GPU, same qualititive result

i also confirm that my gpu util is about 70% when 5 threads are launched, which is like the thread starter says, in case that the gpu is too busy for handling the inference tasks.

I have encountered this issue too. May I ask if you have solved it now?

I tested it on my 4090 desktop and got the same result. I think my reasoning speed is slow because of insufficient memory. The reasoning at the beginning is relatively fast. When the GPU dedicated memory is applied, the system memory will be used. I think this is the reason for the slow reasoning later. I created a context in each thread. Through the output, I found that each context applied for 4 or 8g of memory (different engines). A single context should not need so much memory?