[TensorRT] Speed of concurrent execute multiple TensorRT model on one GPU

Description

When I run a Yolo3 model it cost about 10 ms.

When I run 2 Yolo3 models in a 2080 GPU in 2 threads with 10000 loop concurrently with multiple streams, it cost about 20 ms for every time.

Yolo3 model GPU usage is about 2G, 2080 has 8 G memory, running batch =1.

HOW can I concurrent execute multiple models in multiple threads with multiple streams, the average cost time be 10 ms every time ???

Environment

TensorRT Version: TensorRT 7 and TensorRT 5
GPU Type: TensorRT 7 for 2080 and TensorRT 5 for Titan XP
Nvidia Driver Version: TensorRT 7 for 10.0 and TensorRT 5 for 9.0
CUDA Version: TensorRT 7 for 10.2 and TensorRT 5 for 9.0
CUDNN Version: TensorRT 7 for 7.6.5
Operating System + Version: Ubuntu 16.04
Python Version (if applicable): NA
TensorFlow Version (if applicable): NA
PyTorch Version (if applicable): NA
Baremetal or Container (if container which image + tag): NA

In order to run multiple model with TensorRT, i will recommend you to either use NVIDIA deepstream or NVIDIA Triton Inference Server.
Please refer below link for more details:


https://docs.nvidia.com/deeplearning/sdk/triton-inference-server-guide/docs/index.html

If you want to perform multi threading using TensorRT, please refer below link for best practices:
https://docs.nvidia.com/deeplearning/sdk/tensorrt-archived/tensorrt-700/tensorrt-best-practices/index.html#thread-safety

You can also try batch-inference in a single IExecutionContext. Batching might give higher throughput compared to multiple Execution Contexts.

Thanks