Triton server memory accumulation problem

Description

A clear and concise description of the bug or issue.

Environment

TensorRT Version: [TensorRT 8.6.1.6]
GPU Type: rtx2080
Nvidia Driver Version: 252.105.17
CUDA Version: 12.0
CUDNN Version:
Operating System + Version: ubuntu22.04
Python Version (if applicable): 3.8
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)
Immediately after running triton
±------------------------------±---------------------±---------------------+
| 1 NVIDIA GeForce … Off | 00000000:04:00.0 Off | N/A |
| 35% 30C P2 55W / 260W | 5520MiB / 11264MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

The trtion server is receiving the request and processing it.
±------------------------------±---------------------±---------------------+
| 1 NVIDIA GeForce … Off | 00000000:04:00.0 Off | N/A |
| 35% 31C P2 65W / 260W | 8176MiB / 11264MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

After the triton request is completed
±------------------------------±---------------------±---------------------+
| 1 NVIDIA GeForce … Off | 00000000:04:00.0 Off | N/A |
| 35% 31C P2 55W / 260W | 5680MiB / 11264MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

I use triton23.10.
When running tritonserver and checking nvidia-smi, it says that 5520 is used in the gpu.
In deepstream6.1, an inference request was sent to tritonserver, so memory was consumed by about 8176.
However, when the request is completely terminated and nvidia-smi is checked, the memory becomes 5680, and as a result of repeating the request and checking nvidia-smi,
A problem arises in which memory waste gradually accumulates.
I would appreciate it if you could tell me how to solve this problem.

docker run --gpus device=3 -d --name st_model_convert_always --restart=always
–net=host
-v /home/users/asd/model:/models
nvcr.io/nvidia/tritonserver:23.10-py3 tritonserver --model-repository=/models

The request was made in deepstream6.1

Hi @dbdnjswns2 ,
We request you to reach out to Issues · triton-inference-server/server · GitHub

Thanks