TensorRT Parallel Inference /concurrent inferecing

Hello,
I want to achieve parallel inference on two Tensor RT engine. Could someone please point to documentation/ sample code?

Eg. Engine 1 takes 30 ms and Engine 2 takes 30 ms. I want to create a multi-threaded pipeline where both threads run simultaneously and execute in 30 ms.

Right now, i have created 2 threads with different execution context. The GPU execution is not happeneing in parallel.

TensorRT Version: 7.0.0-1+cuda10.0
GPU Type: RTX 2060
Nvidia Driver Version:
CUDA Driver Version / Runtime Version 10.2 / 10.0
CUDA Capability Major/Minor version number: 7.5
CUDNN Version: -
Operating System + Version: Ubuntu 18.04.4 LTS
Python Version (if applicable): Python 3.6

Screenshot of visual profiler:

Figured out the issue with concurrent execution. We were using trt in sync mode with default stream which was blocking the other context.

After running it in async with a different context, we did not get any performance gain. ie. each context took 30 ms and running it sequentially resulted in 60ms,

Concurrent running takes 60 ms for each context.

Hi @avra,
Kindly refer to the below link for your reference.


Thanks!

Hello @AakankshaS
Thanks for sharing the resources. Following them, we were able to get the 2 contexts running concurrently. However we did not get any performance benifit.
When we run only 1 context it takes ~30 ms and compute utilization is 45%.

When we run them concurrently I was expecting 2 contexts to take ~30ms with 90% overall utilization.
In reality, it’s taking 60ms.


What do you attribute this to? Is it due to GPU resources getting shared bw the two contexts? In Concurrent mode each context utilization drops to 40%

@AakankshaS can you please help me understand this?

Hi @avra,
The Engineering team is looking into this. Please allow us some time.
Thanks!