Hello,
I want to achieve parallel inference on two Tensor RT engine. Could someone please point to documentation/ sample code?
Eg. Engine 1 takes 30 ms and Engine 2 takes 30 ms. I want to create a multi-threaded pipeline where both threads run simultaneously and execute in 30 ms.
Right now, i have created 2 threads with different execution context. The GPU execution is not happeneing in parallel.
TensorRT Version: 7.0.0-1+cuda10.0
GPU Type: RTX 2060
Nvidia Driver Version:
CUDA Driver Version / Runtime Version 10.2 / 10.0
CUDA Capability Major/Minor version number: 7.5
CUDNN Version: -
Operating System + Version: Ubuntu 18.04.4 LTS
Python Version (if applicable): Python 3.6
Figured out the issue with concurrent execution. We were using trt in sync mode with default stream which was blocking the other context.
After running it in async with a different context, we did not get any performance gain. ie. each context took 30 ms and running it sequentially resulted in 60ms,
Hello @AakankshaS
Thanks for sharing the resources. Following them, we were able to get the 2 contexts running concurrently. However we did not get any performance benifit.
When we run only 1 context it takes ~30 ms and compute utilization is 45%.
Hi,
We have not implemented this with the required performance benefits.
From what I have understood this is a Cuda driver limitation. Once you create 2 contexts how that is serviced by the GPU depends on the scheduler. There is also a bunch of GPU arch limitations so little out-of-depth here.