TensorRT Parallel Inference /concurrent inferecing

Hello,
I want to achieve parallel inference on two Tensor RT engine. Could someone please point to documentation/ sample code?

Eg. Engine 1 takes 30 ms and Engine 2 takes 30 ms. I want to create a multi-threaded pipeline where both threads run simultaneously and execute in 30 ms.

Right now, i have created 2 threads with different execution context. The GPU execution is not happeneing in parallel.

TensorRT Version: 7.0.0-1+cuda10.0
GPU Type: RTX 2060
Nvidia Driver Version:
CUDA Driver Version / Runtime Version 10.2 / 10.0
CUDA Capability Major/Minor version number: 7.5
CUDNN Version: -
Operating System + Version: Ubuntu 18.04.4 LTS
Python Version (if applicable): Python 3.6

Screenshot of visual profiler:

Figured out the issue with concurrent execution. We were using trt in sync mode with default stream which was blocking the other context.

After running it in async with a different context, we did not get any performance gain. ie. each context took 30 ms and running it sequentially resulted in 60ms,

Concurrent running takes 60 ms for each context.

Hi @avra,
Kindly refer to the below link for your reference.
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#streaming
https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf

Thanks!

Hello @AakankshaS
Thanks for sharing the resources. Following them, we were able to get the 2 contexts running concurrently. However we did not get any performance benifit.
When we run only 1 context it takes ~30 ms and compute utilization is 45%.

When we run them concurrently I was expecting 2 contexts to take ~30ms with 90% overall utilization.
In reality, it’s taking 60ms.


What do you attribute this to? Is it due to GPU resources getting shared bw the two contexts? In Concurrent mode each context utilization drops to 40%

@AakankshaS can you please help me understand this?

Hi @avra,
The Engineering team is looking into this. Please allow us some time.
Thanks!

Hi Akansha,
Could you please give an update on this?

Hi @avra ,
For concurrent execution, you may look at the below link

Thanks!

hello, I want do the same things as you, and got the same problems. Are you implementing it finally?

hello, I want do the same things as you, and got the same problems. Are you implementing it finally?

Hi,
We have not implemented this with the required performance benefits.

From what I have understood this is a Cuda driver limitation. Once you create 2 contexts how that is serviced by the GPU depends on the scheduler. There is also a bunch of GPU arch limitations so little out-of-depth here.