Simplifying AI Inference with NVIDIA Triton Inference Server from NVIDIA NGC

Originally published at: https://developer.nvidia.com/blog/simplifying-ai-inference-with-nvidia-triton-inference-server-from-nvidia-ngc/

Seamlessly deploying AI services at scale in production is as critical as creating the most accurate AI model. Conversational AI services, for example, need multiple models handling functions of automatic speech recognition (ASR), natural language understanding (NLU), and text-to-speech (TTS) to complete the application pipeline. To provide real-time conversation to users, such applications should be…

Try building your own AI application leveraging Triton Inference Server today, and let us know of any questions or concerns!

Hi, do you have any materials giving a comparison with TensorFlow TFX model serving ?

Hi, I’m looking for an inference server to provide access to experimental models in our R&D department. Therefore, speed is not the top priority.
Do all models have to fit in the GPU memory at the same time? Or are models unloaded and reloaded if necessary?
In our scenario, many AI models (and old versions) are provided which in total would require more GPU memory than available, but a reload delay of a model which isn’t active would be tolerable at inference time.