Originally published at: https://developer.nvidia.com/blog/nvidia-serves-deep-learning-inference/
You’ve built, trained, tweaked and tuned your model. You finally create a TensorRT, TensorFlow, or ONNX model that meets your requirements. Now you need an inference solution, deployable to a datacenter or to the cloud. Your solution should make optimal use of the available GPUs to get the maximum possible performance. Perhaps other requirements also…
Greetings.
I see you use the V100 to do the demonstration. Concurrency: 8, 2413 infer/sec, latency 26473 usec. It gains a lot of performance when set the instance_group to 8.
Do you have the same experiment on Pascal gpu? say P40? Will that help too?
Would .pb files work for caffe2 instead of the .netdef model files?
According to one of the authors:
By .pb files I think you mean TensorFlow saved-model format. TRTIS supports many different model formats including TensorFlow saved-model. You wouldn't use a .pb file with Caffe2 since it is not a Caffe2 model format, but TRTIS supports .pb (saved-model) as well as netdef. The full list of supported formats can be found in the GitHub README and linked documentation: https://github.com/NVIDIA/t...
You might also try posting your question on the NVIDIA TensorRT devtalk forum: https://devtalk.nvidia.com/...
Why I had the doubt about Caffe2 + .pb files is, the model zoo provided by Caffe2 has all .pb files. Here: https://github.com/caffe2/m...
(Dumb question since I'm new to all this) I'm guessing if both Caffe2 and TF can generate .pb files, it doesn't mean both models would have the same structure right, meaning the a tensorflow model saved as .pb is not equivalent to a Caffe2 model saved as .pb (considering the model itself to be the same)
Also I am unable to find any tutorial on how to save as .netdef in Caffe2. So I'm scratching my head at this point?
Thanks Loyd
If the model repository contains a lot of models which can not be accommodated by a GPU at the same time (probably due to GPU memory limit), is there a scheduling policy to load/unload models dynamically? If so, what is the impact to inference latency if a request hits an unloaded model?
You will get quicker responses if you ask your questions in the github project: github.com/triton-inference-server
Triton does not automatically load and uinload models. But you can use the model control API to manually load and unload models. See https://github.com/triton-inference-server/server/blob/master/docs/model_management.md