Speeding up multi-threaded C++ program of TensorRT models

phuongndvn · January 6, 2023, 3:08am

Description

I have 4 TensorRT models running on 4 threads in C++. When I run these models in single-threaded program, the FPS really good. However, when I run model in the multi-threaded program, the FPS of model reduce 3 times. I checked the usage of GPU, it is low ~45%. In each thread that runs the tensorrt model, the steps are:

cudaMemcpyAsync(H->D)
enqueueV2()
cudaMemcpyAsync(D->H)

Each thread has its own context, stream to run the tensorrt model.

What should I do to increase the percentage of GPU usage with multi-threaded program (note that I already inference the models with batch) ?
As I read from MPS document, work launched from work queues belonging to the same cuda context can execute concurrently on the GPU and can not execute concurrently if they belong to different cuda context. My question is can I create only one cuda context and share it to all the threads to have more chances to run models concurrently?

Environment

TensorRT Version: 8.4.2.4
GPU Type: 3060
Nvidia Driver Version:510
CUDA Version: 470.161.03
CUDNN Version: 8.5.0.96
Operating System + Version: Ubuntu 20.04

spolisetty · January 10, 2023, 10:14am

Hi,

The below links might be useful for you.
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#thread-safety

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html

For multi-threading/streaming, will suggest you to use Deepstream or TRITON

For more details, we recommend you raise the query in Deepstream forum.

or

raise the query in Triton Inference Server Github instance issues section.

Thanks!

thuwgf · November 24, 2023, 4:44am

I am searching this question for days and still get no working example. The problem is that I wanna deploy the tensorrt engine in C++ and for multi RTSP inference, so what i do now is using the multithreading inference. But here is the pain: when running in one thread, the time consumed per frame is like 10-20ms，but for 2/3/4/5 threads, it will increase the time like the number of threads(approx.).
p.s. i have tried using same( or diffrerent, i tried many options) runtime/engine for multithreads, and i confirm that for memcpy i have used the paged_lock mem by cudaHostAlloc. all these trials are not helpful!!!

my question is: is it even possible to use multithreads inference with c++ for tensorrt project? if the answer is yes, then how(or is there any working examples?)

thuwgf · November 24, 2023, 4:45am

besides, i have test on my laptop 3060 GPU and a desktop 4090GPU, same qualititive result

thuwgf · November 24, 2023, 4:52am

i also confirm that my gpu util is about 70% when 5 threads are launched, which is like the thread starter says, in case that the gpu is too busy for handling the inference tasks.

mazhuoru3653 · March 8, 2024, 6:21am

I have encountered this issue too. May I ask if you have solved it now?

huangyison · August 2, 2024, 3:33am

I tested it on my 4090 desktop and got the same result. I think my reasoning speed is slow because of insufficient memory. The reasoning at the beginning is relatively fast. When the GPU dedicated memory is applied, the system memory will be used. I think this is the reason for the slow reasoning later. I created a context in each thread. Through the output, I found that each context applied for 4 or 8g of memory (different engines). A single context should not need so much memory?

JiuZou · February 20, 2025, 11:19am

I have the same question, thank you.

Topic		Replies	Views
Tensorrt Threads affect each other during multithreaded inference TensorRT tensorrt	16	1415	September 6, 2024
Parallel execution of several trt contexts on one GPU TensorRT onnx	1	1203	August 7, 2023
how to run trt in multithreading？ Jetson TX2	15	7980	October 18, 2021
Multithread does not improve inference performance with tensorrt models TensorRT tensorrt	2	1183	May 11, 2021
TensorRT Parallel Inference /concurrent inferecing TensorRT tensorrt	10	4105	October 13, 2022
Is TensorRT safe to create engine & context in one thread, and execute in another thread? TensorRT	1	699	June 5, 2022
Inference Time When Using Multi Stream in TensorRT is Much Slower than a Single One TensorRT tensorrt	5	2490	March 30, 2023
Running Real-Time Instance Segmentation with Local GPUs TensorRT tensorrt , camera , ros , python , cudnn	2	64	February 18, 2025
Tensorrt multi gpu with multi threads TensorRT	1	1097	February 18, 2022
Inference Time When Using Multi Stream Multi Context in TensorRT is Slower than a Single One TensorRT tensorrt , cuda , cudnn	1	41	November 30, 2024

Speeding up multi-threaded C++ program of TensorRT models

Description

Environment

Related topics