Parallel execution of several trt contexts on one GPU

PavloTytarenko · August 6, 2023, 1:06pm

Description

I’m running AI inference on video frames. Using RTX 4090 GPU and trt engine for ONNX models. C++. In order to accelerate all this I’m trying to run inference in parallel on one gpu using several cpu threads. In each cpu thread an absolutely new trt context is created which is using it’s own separate cuda stream.
As a result I see if say I launched 3 cpu threads then used gpu resource is tripled as it actually should be. And now we came to a problem. No fps increase is observed. Performance is exactly the same as if I were running just one single trt context.
But if I add another 2 4090 devices and do the same - like 3 trt contexts in parallel on 3 different devices then all is fine. My fps is tripled and all is working as intended.
So the problem is like even if I’m running on one gpu device several independent trt contexts, they are still executing kind of sequentially and not parallel

Environment

TensorRT Version: 8.5.2.2
GPU Type: RTX 4090 but applicable to all GPU types
Nvidia Driver Version: 536.67
CUDA Version: 11.8
CUDNN Version: 8.6.0.163
Operating System + Version: Windows 10-11
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

Exact steps/commands to build your repro
Exact steps/commands to run your repro
Full traceback of errors encountered

AakankshaS · August 7, 2023, 3:37am

Hi,

The below links might be useful for you.

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html

For multi-threading/streaming, will suggest you to use Deepstream or TRITON

For more details, we recommend you raise the query in Deepstream forum.

or

raise the query in Triton Inference Server Github instance issues section.

Thanks!

Topic		Replies	Views
Tensorrt Threads affect each other during multithreaded inference TensorRT tensorrt	16	1380	September 6, 2024
Speeding up multi-threaded C++ program of TensorRT models TensorRT tensorrt	7	1340	February 20, 2025
TensorRT Parallel Inference /concurrent inferecing TensorRT tensorrt	10	4037	October 13, 2022
Optimal Trt inference using threads/processes for peoplenet model for Triton Inference Server - archived tensorrt , inference-server-triton , a100	1	1151	July 30, 2021
Running Real-Time Instance Segmentation with Local GPUs TensorRT tensorrt , camera , ros , python , cudnn	2	56	February 18, 2025
Inference Time When Using Multi Stream in TensorRT is Much Slower than a Single One TensorRT tensorrt	5	2466	March 30, 2023
how to run trt in multithreading？ Jetson TX2	15	7949	October 18, 2021
Is TensorRT safe to create engine & context in one thread, and execute in another thread? TensorRT	1	690	June 5, 2022
TensorRT MultiThread with MultiGPU TensorRT	1	482	February 14, 2023
Latency when running TensorRT engine on two GPU TensorRT	9	1233	August 24, 2020

Parallel execution of several trt contexts on one GPU

Description

Environment

Relevant Files

Steps To Reproduce

Related topics