How to run inference in multithread( only Allocate host and device buffers once for all execution contexts)

Hi,

I have several TensorRT engines including yolo and inception for detection and classification tasks, respectively. What I have done is running these engines one by one, which takes much time though. What I want to do is to run them in parallel using multi-threading in python. I know the TensorRT runtime can be used by multiple threads simultaneously, so long as each object uses a different execution context. But in each execution context, It will allocate host and device buffers for itself, so the total allocated buffers will be quite large. Is there any way to run inferences in multi-threading and only allocate buffers once for all execution contexts? Thanks!

Hi,
The below link might be useful for you
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#thread-safety
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#stream-priorities
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html
For multi threading/streaming, will suggest you to use Deepstream or TRITON
For more details, we recommend you to raise the query to the Deepstream or TRITON forum.

Thanks!