If the model repository contains a lot of models which can not be accommodated by one GPU at the same time (probably due to GPU memory limit), is there a scheduling policy to load/unload models dynamically? If so, what is the impact to inference latency if a request hits an unloaded model?
Related topics
| Topic | Replies | Views | Activity | |
|---|---|---|---|---|
| Triton-server model load balancing | 6 | 1060 | February 8, 2023 | |
| Triton inference server dynamic load | 0 | 240 | July 18, 2024 | |
| List of available models in Model control mode | 0 | 500 | March 6, 2020 | |
| Inference speed of Triton Server | 0 | 723 | December 19, 2023 | |
| NVIDIA Triton Inference Server Boosts Deep Learning Inference | 6 | 580 | February 22, 2021 | |
| Fast and Scalable AI Model Deployment with NVIDIA Triton Inference Server | 0 | 463 | November 9, 2021 | |
| [TensorRT] How to load multiple TF-TRT model? | 2 | 1903 | May 14, 2020 | |
| Inconsistant GPU memory utilsation with parallel model instances | 0 | 849 | July 5, 2021 | |
| Multiple model Inference And Runtime Model Switching | 3 | 977 | May 13, 2024 | |
| Identifying the Best AI Model Serving Configurations at Scale with NVIDIA Triton Model Analyzer | 0 | 437 | May 23, 2022 |