TensorRT Parallel Inference /concurrent inferecing

avra · December 3, 2020, 9:50am

Hello,
I want to achieve parallel inference on two Tensor RT engine. Could someone please point to documentation/ sample code?

Eg. Engine 1 takes 30 ms and Engine 2 takes 30 ms. I want to create a multi-threaded pipeline where both threads run simultaneously and execute in 30 ms.

Right now, i have created 2 threads with different execution context. The GPU execution is not happeneing in parallel.

TensorRT Version: 7.0.0-1+cuda10.0
GPU Type: RTX 2060
Nvidia Driver Version:
CUDA Driver Version / Runtime Version 10.2 / 10.0
CUDA Capability Major/Minor version number: 7.5
CUDNN Version: -
Operating System + Version: Ubuntu 18.04.4 LTS
Python Version (if applicable): Python 3.6

Screenshot of visual profiler:

avra · December 3, 2020, 11:45am

Figured out the issue with concurrent execution. We were using trt in sync mode with default stream which was blocking the other context.

After running it in async with a different context, we did not get any performance gain. ie. each context took 30 ms and running it sequentially resulted in 60ms,

Concurrent running takes 60 ms for each context.

AakankshaS · December 3, 2020, 6:15pm

Hi @avra,
Kindly refer to the below link for your reference.
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#streaming
https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf

Thanks!

avra · December 4, 2020, 11:08am

Hello @AakankshaS
Thanks for sharing the resources. Following them, we were able to get the 2 contexts running concurrently. However we did not get any performance benifit.
When we run only 1 context it takes ~30 ms and compute utilization is 45%.

When we run them concurrently I was expecting 2 contexts to take ~30ms with 90% overall utilization.
In reality, it’s taking 60ms.

What do you attribute this to? Is it due to GPU resources getting shared bw the two contexts? In Concurrent mode each context utilization drops to 40%

avra · December 10, 2020, 2:35pm

@AakankshaS can you please help me understand this?

AakankshaS · December 11, 2020, 6:19am

Hi @avra,
The Engineering team is looking into this. Please allow us some time.
Thanks!

avra · April 14, 2021, 6:40am

Hi Akansha,
Could you please give an update on this?

AakankshaS · April 14, 2021, 6:44am

Hi @avra ,
For concurrent execution, you may look at the below link

Thanks!

1611074199 · June 8, 2021, 3:25am

hello, I want do the same things as you, and got the same problems. Are you implementing it finally？

827929990 · October 13, 2022, 10:40am

hello, I want do the same things as you, and got the same problems. Are you implementing it finally？

avra · October 13, 2022, 11:01am

Hi,
We have not implemented this with the required performance benefits.

From what I have understood this is a Cuda driver limitation. Once you create 2 contexts how that is serviced by the GPU depends on the scheduler. There is also a bunch of GPU arch limitations so little out-of-depth here.

Topic		Replies	Views
Parallel execution of several trt contexts on one GPU TensorRT onnx	1	1176	August 7, 2023
Latency when running TensorRT engine on two GPU TensorRT	9	1234	August 24, 2020
Speeding up multi-threaded C++ program of TensorRT models TensorRT tensorrt	7	1340	February 20, 2025
Unable to do inference of multiple engines in parallel TensorRT tensorrt , nano	3	1721	May 6, 2022
Tensorrt multiple process TensorRT tensorrt	2	1553	February 21, 2024
TensorRT Concurrent inference in C++ TensorRT cudnn	4	614	February 6, 2024
how to run trt in multithreading？ Jetson TX2	15	7954	October 18, 2021
Tensorrt Threads affect each other during multithreaded inference TensorRT tensorrt	16	1388	September 6, 2024
Inference Time When Using Multi Stream in TensorRT is Much Slower than a Single One TensorRT tensorrt	5	2471	March 30, 2023
Batch inference parallelization on tensorrt TensorRT tensorrt , cuda	5	957	May 5, 2021

TensorRT Parallel Inference /concurrent inferecing

Related topics