Description
We are observing spikes in RAM (~40GB) while using Triton Inference server. We have python backend models (CPU and GPU) and TensorRT models. And we also use BLS in the pipeline.
Environment
Trition verison: 22.12 (nvcr.io/nvidia/tritonserver:22.12-py3)
Python Backend verison: r21.08
TensorRT Version: 8.5.1
GPU Type: GeForce RTX 2080 SUPER
Nvidia Driver Version: 510.108.03
CUDA Version: 11.8
CUDNN Version: 8.7.0 GA
Operating System + Version: Ubuntu 20.04
Python Version (if applicable): 3.7
Relevant Files
Steps To Reproduce
We couldn’t find the exact reason for the spikes, so cannot specify the steps to reproduce. This issue happens mostly on long runs (8-12 hrs) of Triton Server.