Inference speed of Triton Server

danishshres · December 19, 2023, 9:04pm

##Description

I am encountering an issue related to latency in the Triton server when performing inference on a model. It appears that the latency is influenced by whether only one model is loaded into the GPU or if there are multiple models of different types loaded into the GPU, even when there is sufficient space in the GPU.

Steps To Reproduce

Models Overview:

Swin-Transformer Model: Loaded into the Triton server using the Python backend. The Python backend internally loads the ONNX model with dynamic batching enabled and a max_queue_delay_microseconds setting of 10000.
TensorRT Single Shot Detector Model: Loaded into the Triton server without dynamic batching.

Latency Measurement:

When only the ‘python’ backend is loaded and the Swin-Transformer model is in use, the latency is approximately 300ms. The model creates batches in a sequence such as (8, 4, 8, 4, 8, 4, …).

Latency Variation with Multiple Models:

Upon adding the TensorRT Single Shot Detector model to the server, the latency of the Swin-Transformer model sometimes increases to 3 seconds.
Triton executes batches with random sizes, e.g., (1, 3, 8, 5, 7, 2).

This behavior occurs even when there is adequate GPU space.

Topic		Replies	Views
Triton-server model load balancing DeepStream SDK inference-server-triton	6	963	February 8, 2023
Latency linearly increases when increased batch size or concurrent models TensorRT inference-server-triton	15	2038	September 29, 2021
Latency linearly increases when increased batch size or concurrent models Tensorrt Triton Inference Server - archived tensorrt	3	1801	October 1, 2021
Large latency when using `tritonclient.http.aio.infer` TensorRT tensorrt , cudnn , inference-server-triton	1	184	June 29, 2024
Triton inference speed test TensorRT cudnn	1	368	December 22, 2023
Model deployment latency measurment TensorRT tensorrt	3	495	October 12, 2021
TensorRt inference is taking 1.5 sec to inference a single frame.i want to speed up my inference TensorRT tensorrt , jetson-inference , jetson-nano	1	912	March 13, 2023
TensorRt inference is taking 1.5 sec to inference a single frame.i want to speed up my inference.How can i do that TensorRT tensorrt , cuda , jetson-nano	3	760	March 13, 2023
Windows systems perfomance issue TensorRT tensorrt , inference-server-triton	1	15	April 30, 2025
Test triton with jmeter, much less throughoutput than perf-analyzer TensorRT inference-server-triton	1	467	November 15, 2023

Inference speed of Triton Server

Steps To Reproduce

Related topics