How to run inference in multithread( only Allocate host and device buffers once for all execution contexts)


I have several TensorRT engines including yolo and inception for detection and classification tasks, respectively. What I have done is running these engines one by one, which takes much time though. What I want to do is to run them in parallel using multi-threading in python. I know the TensorRT runtime can be used by multiple threads simultaneously, so long as each object uses a different execution context. But in each execution context, It will allocate host and device buffers for itself, so the total allocated buffers will be quite large. Is there any way to run inferences in multi-threading and only allocate buffers once for all execution contexts? Thanks!

Hi @1036758468.
Please note this forum branch is dedicated to CUDA GDB support. You question might be more suitable for different forums:

1 Like

ok, thanks!