Inconsistant GPU memory utilsation with parallel model instances

sanchit.prakash · July 5, 2021, 6:59am

Triton Server Version: 2.7.0

Triton Docker image: nvcr.io/nvidia/tritonserver:21.02-py3

GPU used: 1 Tesla T4 16GB

TensorFlow version: 2

Experiment objective: Try to load as many parallel instances of a TF model (.pb file) and determine the point of failure when Triton Server fails to load anymore.

Model used: ResNet50

Observations recorded in this document.

Unexpected Behavior:

On linearly increasing the instance count in config.txt file of the model and observing the memory occupied through nvidia-smi and triton metrics both, we see the inconsistent numbers. Ideally, the GPU memory occupied should also increment linearly but the values remain constant for 3, 4, and 5 and suddenly spikes for 7 parallel instances. On further increasing the instance count beyond 7 the GPU memory consumption saturates at ~10GB.
The Triton logs do not show any proper information pertaining to how many parallel instances of the same model was actually loaded.
When instance count was kept at 25, the triton server successfully hosted the model for inferencing, hogging almost ~10GB of GPU memory. Ideally it should have crashed while trying to load these many models in parallel.

Desired Behavior: We did the same experiment with a TensorRT model (PeopleNet) (.plan file) and GPU memory increased linearly while incrementing the instance count. Also, the Triton Server crashed when we tried to load 10 parallel instances at once.