Inference speed of Triton Server

##Description

I am encountering an issue related to latency in the Triton server when performing inference on a model. It appears that the latency is influenced by whether only one model is loaded into the GPU or if there are multiple models of different types loaded into the GPU, even when there is sufficient space in the GPU.

Steps To Reproduce

  1. Models Overview:
  • Swin-Transformer Model: Loaded into the Triton server using the Python backend. The Python backend internally loads the ONNX model with dynamic batching enabled and a max_queue_delay_microseconds setting of 10000.
  • TensorRT Single Shot Detector Model: Loaded into the Triton server without dynamic batching.
  1. Latency Measurement:
  • When only the ‘python’ backend is loaded and the Swin-Transformer model is in use, the latency is approximately 300ms. The model creates batches in a sequence such as (8, 4, 8, 4, 8, 4, …).
  1. Latency Variation with Multiple Models:
  • Upon adding the TensorRT Single Shot Detector model to the server, the latency of the Swin-Transformer model sometimes increases to 3 seconds.
  • Triton executes batches with random sizes, e.g., (1, 3, 8, 5, 7, 2).

This behavior occurs even when there is adequate GPU space.