Multithread does not improve inference performance with tensorrt models

@AastaLLL
I am using two threads, each run one tensorrt model, however inference latency is approximately running two models serially. From my observation, both threads run concurrently, however, the time it takes to process each of the thread is double.
Detail Multithread tensorrt does not improve inference latency · Issue #1238 · NVIDIA/TensorRT · GitHub

Hi,
The below link might be useful for you
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#thread-safety
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#stream-priorities
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html
For multi threading/streaming, will suggest you to use Deepstream or TRITON
For more details, we recommend you to raise the query to the Deepstream or TRITON forum.

Thanks!

Hi @hoangtm.fami,

It depends on the model. If each model takes very little GPU resources, then multi-threading would have benefit. Please check gpu utilization.

Thank you.