Using TensorRT models with NanoLLM causes memory leaks

Description

I ran nano_llm.vision.video of the NanoLLM application using the docker image dystynv/nano_llm:r36.3.0.
After running the application for 5 hours and checking RSS, I found that there is a memory leak.
Could you tell me how to fix memory leaks?

Following the issue in the link below, which is a known memory leak problem, I have already added gc.collect.

Additionally, since I do not need the video output, I have deleted the code related to the video_output variable.
Please check video.py in Relevant Files for changes to /opt/NanoLLM/nano_llm/vision/video.py.
When not using TensorRT models, there were hardly any memory leaks.
Therefore, using the TensorRT model causes memory leaks.

Environment

TensorRT Version: 8.6.2
GPU Type: Jetson Orin
CUDA Version: 12.2
CUDNN Version: 8.9.4
Operating System + Version: Ubuntu 22.04
Python Version (if applicable): 3.10.12
PyTorch Version (if applicable): 2.2.0
Baremetal or Container (if container which image + tag): Container dustynv/nano_llm:r36.3.0

Relevant Files

The model used was Efficient-Large-Model/VILA1.5-3b.
It is automatically downloaded at runtime.

The following is a script for executing VLM.
run.sh.txt (284 Bytes)
(Due to restrictions on uploadable file extensions, “.txt” has been added to the end of the file name.)

I made the following changes to /opt/NanoLLM/nano_llm/vision/video.py.
video.py.txt (3.7 KB)
(Due to restrictions on uploadable file extensions, “.txt” has been added to the end of the file name.)

diff --git a/nano_llm/vision/video.py b/nano_llm/vision/video.py
index aa32878..336d304 100644
--- a/nano_llm/vision/video.py
+++ b/nano_llm/vision/video.py
@@ -24,6 +24,7 @@ from nano_llm.plugins import VideoSource, VideoOutput

 from termcolor import cprint
 from jetson_utils import cudaMemcpy, cudaToNumpy, cudaFont
+import gc

 # parse args and set some defaults
 parser = ArgParser(extras=ArgParser.Defaults + ['prompt', 'video_input', 'video_output'])
@@ -72,15 +73,11 @@ def on_video(image):
     if last_text:
         font_text = remove_special_tokens(last_text)
         wrap_text(font, image, text=font_text, x=5, y=5, color=(120,215,21), background=font.Gray50)
-    video_output(image)

 video_source = VideoSource(**vars(args), cuda_stream=0)
 video_source.add(on_video, threaded=False)
 video_source.start()

-video_output = VideoOutput(**vars(args))
-video_output.start()
-
 font = cudaFont()

 # apply the prompts to each frame
@@ -123,8 +120,8 @@ while True:

     if num_images >= args.max_images:
         chat_history.reset()
+        gc.collect()
         num_images = 0

     if video_source.eos:
-        video_output.stream.Close()
         break

The following file is the script for obtaining RSS logs.
The RSS is checked every 5 seconds using the ps -aux command.
ps_5sec.bash.txt (197 Bytes)
(Due to restrictions on uploadable file extensions, “.txt” has been added to the end of the file name.)

The following log file is a log of the RSS when running VLM with the TensorRT model.
psinfo.log (1.2 MB)

Additionally, the following log file is a log of the RSS when running VLM without using the TensorRT model.
psinfo.log (1.2 MB)

Steps To Reproduce

When using a TensorRT model

Launch a Docker container

sudo docker run -itd --runtime=nvidia --device=/dev/video0:/dev/video0 -v ${PWD}:${PWD} -w ${PWD} -e PYTHONPATH=/opt/clip_trt:/opt/NanoLLM:/opt/NanoDB:/opt/faiss_lite dustynv/nano_llm:r36.3.0
sudo docker ps
sudo docker exec -it <container name> bash

Downloading models and saving TensorRT models

mkdir -p -m 777 /data/models/mlc/dist/models
mkdir -p -m 777 /data/models/clip
cp video.py  /opt/NanoLLM/nano_llm/vision/video.py
bash ./run.sh
# Wait until inference begins.
# Exit the program with Ctrl + c.

Restart Jetson

exit
sudo docker stop <container name>
sudo reboot

Launch a Docker container

sudo docker start <container name>
sudo docker exec -it <container name> bash

Run the VLM program

nohup bash ./run.sh &

Run a program to get the memory log

exit
nohup bash ./ps_5sec.bash psinfo.log &

When not using a TensorRT model

Launch a Docker container

sudo docker run -itd --runtime=nvidia --device=/dev/video0:/dev/video0 -v ${PWD}:${PWD} -w ${PWD} -e PYTHONPATH=/opt/clip_trt:/opt/NanoLLM:/opt/NanoDB:/opt/faiss_lite dustynv/nano_llm:r36.3.0
sudo docker ps
sudo docker exec -it <container name> bash

Downloading models
(Don’t make the /data/models/clip directory.)

mkdir -p -m 777 /data/models/mlc/dist/models
cp video.py  /opt/NanoLLM/nano_llm/vision/video.py
bash ./run.sh
# Wait until inference begins.
# Exit the program with Ctrl + c.

Restart Jetson

exit
sudo docker stop <container name>
sudo reboot

Launch a Docker container

sudo docker start <container name>
sudo docker exec -it <container name> bash

Run the VLM program

nohup bash ./run.sh &

Run a program to get the memory log

exit
nohup bash ./ps_5sec.bash psinfo.log &