How to run inference in multithread( only Allocate host and device buffers once for all execution contexts)

1036758468 · July 13, 2021, 8:42am

Hi,

I have several TensorRT engines including yolo and inception for detection and classification tasks, respectively. What I have done is running these engines one by one, which takes much time though. What I want to do is to run them in parallel using multi-threading in python. I know the TensorRT runtime can be used by multiple threads simultaneously, so long as each object uses a different execution context. But in each execution context, It will allocate host and device buffers for itself, so the total allocated buffers will be quite large. Is there any way to run inferences in multi-threading and only allocate buffers once for all execution contexts? Thanks!

AKravets · July 13, 2021, 8:55am

Hi @1036758468.
Please note this forum branch is dedicated to CUDA GDB support. You question might be more suitable for different forums:

1036758468 · July 13, 2021, 8:59am

ok, thanks!