Triton Server Version: 2.7.0
Triton Docker image: nvcr.io/nvidia/tritonserver:21.02-py3
GPU used: 1 Tesla T4 16GB
TensorFlow version: 2
Experiment objective: Try to load as many parallel instances of a TF model (.pb file) and determine the point of failure when Triton Server fails to load anymore.
Model used: ResNet50
Observations recorded in this document.
Unexpected Behavior:
-
On linearly increasing the instance count in config.txt file of the model and observing the memory occupied through nvidia-smi and triton metrics both, we see the inconsistent numbers. Ideally, the GPU memory occupied should also increment linearly but the values remain constant for 3, 4, and 5 and suddenly spikes for 7 parallel instances. On further increasing the instance count beyond 7 the GPU memory consumption saturates at ~10GB.
-
The Triton logs do not show any proper information pertaining to how many parallel instances of the same model was actually loaded.
-
When instance count was kept at 25, the triton server successfully hosted the model for inferencing, hogging almost ~10GB of GPU memory. Ideally it should have crashed while trying to load these many models in parallel.
Desired Behavior: We did the same experiment with a TensorRT model (PeopleNet) (.plan file) and GPU memory increased linearly while incrementing the instance count. Also, the Triton Server crashed when we tried to load 10 parallel instances at once.