How does Triton server manage many models?

AI & Data Science Deep Learning (Training & Inference) Triton Inference Server (archived)

wangqg1 February 18, 2021, 4:26am 1

If the model repository contains a lot of models which can not be accommodated by one GPU at the same time (probably due to GPU memory limit), is there a scheduling policy to load/unload models dynamically? If so, what is the impact to inference latency if a request hits an unloaded model?

Topic		Replies	Views
Triton-server model load balancing DeepStream SDK inference-server-triton	6	1060	February 8, 2023
Triton inference server dynamic load TensorRT inference-server-triton	0	240	July 18, 2024
List of available models in Model control mode Triton Inference Server (archived)	0	500	March 6, 2020
Inference speed of Triton Server Triton Inference Server (archived) tensorrt , python , inference-server-triton	0	723	December 19, 2023
NVIDIA Triton Inference Server Boosts Deep Learning Inference Technical Blog	6	580	February 22, 2021
Fast and Scalable AI Model Deployment with NVIDIA Triton Inference Server Technical Blog	0	463	November 9, 2021
[TensorRT] How to load multiple TF-TRT model? TensorRT tensorrt , tensorflow	2	1903	May 14, 2020
Inconsistant GPU memory utilsation with parallel model instances Triton Inference Server (archived)	0	849	July 5, 2021
Multiple model Inference And Runtime Model Switching Isaac ROS ros , isaac-ros-dnn-inference	3	977	May 13, 2024
Identifying the Best AI Model Serving Configurations at Scale with NVIDIA Triton Model Analyzer Technical Blog	0	437	May 23, 2022

How does Triton server manage many models?

Related topics