@AastaLLL
I am using two threads, each run one tensorrt model, however inference latency is approximately running two models serially. From my observation, both threads run concurrently, however, the time it takes to process each of the thread is double.
Detail Multithread tensorrt does not improve inference latency · Issue #1238 · NVIDIA/TensorRT · GitHub
Hi,
The below link might be useful for you
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#thread-safety
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html
For multi threading/streaming, will suggest you to use Deepstream or TRITON
For more details, we recommend you to raise the query to the Deepstream or TRITON forum.
Thanks!
Hi @hoangtm.fami,
It depends on the model. If each model takes very little GPU resources, then multi-threading would have benefit. Please check gpu utilization.
Thank you.