Memory Leak of deepstream-test3 (using grpc, triton-server)

I confirmed that the Python sample app provided by NVIDIA has deepstream-test3 that uses triton-server, so I would like to check its operation and check if there is a memory leak.

[Overview]
• Hardware Platform (Jetson / GPU) = Jetson Orin NX 16G
• DeepStream Version = DS 7.0
• JetPack Version (valid for Jetson only) = JetPack 6.0 GA (L4T 36.3)
• Issue Type( questions, new requirements, bugs) = Bug
• How to reproduce the issue?

  • Sample app: deepstream_python_apps / apps /deepstream-test3 (deepstream_python_apps/apps at master · NVIDIA-AI-IOT/deepstream_python_apps · GitHub deepstream-test3)

  • System configuration:

    • (1) Docker container running the deepstream-test3 sample app with mp4 file as test input
    • (2) Triton Server (using grpc) running YOLO detector
    • (1) and (2) are running on the same Orin NX 16G host.
  • State while running app: kmalloc-128 increases continuously and host memory usage increases

  • Test Duration: 30min

  • Average FPS: 130

  • Host kmalloc-128 Amount, Size:

    • Start: [amunt:10015, size:40,060K]
    • End: [amount:54910: size;219,664K]
    • Diff: [amount:+44895, size:+179,604K]

What should we do to solve this problem?
I would appreciate it if you could check it out.

Thank you.

Can you attach your command to check the memory? You can also use the valgrind to check the memory leak.

Thank you for your confirmation.
I would like to share the script of slab cache investigation as below:

#!/bin/bash
start_time=$(date +"%Y%m%d_%H%M%S")
output_file="/path/to/slabtop-kmalloc-128-$start_time.csv"
echo "Timestamp,OBJS,ACTIVE,USE,OBJ_SIZE,SLABS,OBJSLAB,CACHE_SIZE,NAME" > $output_file

while true; do
  timestamp=$(date +"%Y-%m-%d %H:%M:%S")
  slabtop -o | grep "kmalloc-128" | grep -v "dma-kmalloc-128" | awk -v ts="$timestamp" '{
    print ts "," $1 "," $2 "," $3 "," $4 "," $5 "," $6 "," $7 "," $8
  }' >> $output_file
  sleep 10
done

OK. Can you make the following comparison to narrow down that?
Run the deepstream-test3 without triton(just use the config_infer_primary_peoplenet.txt as the config file)

Run the deepstream-test3 without triton(just use the config_infer_primary_peoplenet.txt as the config file)

This problem does not occur unless you use nvinfer_server. So this problem is due to the cooperation between deepstream-test3 and triton_server.

2 Likes

I have tried that on my Orin board with DS7.0.
nvinfer:

python3 deepstream_test_3.py -i file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.mp4 --pgie nvinfer -c config_infer_primary_peoplenet.txt --no-display --silent --file-loop

nvinferserver:

python3 deepstream_test_3.py -i file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.mp4 --pgie nvinferserver-grpc -c config_triton_grpc_infer_primary_peoplenet.txt --no-display --silent --file-loop

slabtop-kmalloc-128-20240702_165210_nvinfer.csv (146.5 KB)
slabtop-kmalloc-128-20240702_172558_nvinferserver.csv (157.7 KB)
They are all growing all the time. The CACHE_SIZE of nvinferserver is growing faster than nvinfer.
nvinfer: 523748K->566340K
nvinferserver: 570440K->663576K

We will run that for 24 hours to check the memory and analyze this problem. Thanks

1 Like

Hi @yosuke.hara , could you try the following method to narrow down this issue.

  1. Record the kmalloc-128 memory growth in nvinfer and nvinferserver modes separately
  2. Modify the code and record the kmalloc-128 memory growth in nvinfer and nvinferserver modes separately again
    queue1.link(queue2)
    #pgie.link(queue2)

Hello,

We will run that for 24 hours to check the memory and analyze this problem.

First, I would like to know your company’s test results.

  1. Record the kmalloc-128 memory growth in nvinfer and nvinferserver modes separately
  2. Modify the code and record the kmalloc-128 memory growth in nvinfer and nvinferserver modes separately again

Also, since I belong to the QA team, I would like to inform you in advance that I only have an environment that can run nvinferserver. Based on the above, I can only provide the test results for [2] Modify the code 3 and record the kmalloc-128 memory growth nvinferserver.

Thank you.

1 Like

As per your request, I modified the code of deepstream_test3 as below and executed it.

@@ -369,8 +376,9 @@ def main(args, requested_pgie=None, config=None, disable_probe=False):
 
     print("Linking elements in the Pipeline \n")
     streammux.link(queue1)
-    queue1.link(pgie)
-    pgie.link(queue2)
+    queue1.link(queue2)
+    # queue1.link(pgie)
+    # pgie.link(queue2)
     if nvdslogger:
         queue2.link(nvdslogger)
         nvdslogger.link(tiler)

I share the kmalloc-128 metrics for 10min as below:

Timestamp,OBJS,ACTIVE,USE,OBJ_SIZE,SLABS,OBJSLAB,CACHE_SIZE,NAME
2024-07-11 08:29:05,288192,285047,98%,0.12K,9006,32,36024K,kmalloc-128
2024-07-11 08:29:15,288192,285310,98%,0.12K,9006,32,36024K,kmalloc-128
2024-07-11 08:29:25,288192,287049,99%,0.12K,9006,32,36024K,kmalloc-128
2024-07-11 08:29:35,288192,287382,99%,0.12K,9006,32,36024K,kmalloc-128
2024-07-11 08:29:45,289952,289861,99%,0.12K,9061,32,36244K,kmalloc-128
2024-07-11 08:29:55,289888,287870,99%,0.12K,9059,32,36236K,kmalloc-128
2024-07-11 08:30:05,289888,287989,99%,0.12K,9059,32,36236K,kmalloc-128
2024-07-11 08:30:15,289888,288276,99%,0.12K,9059,32,36236K,kmalloc-128
2024-07-11 08:30:25,289888,288817,99%,0.12K,9059,32,36236K,kmalloc-128
2024-07-11 08:30:35,289888,289155,99%,0.12K,9059,32,36236K,kmalloc-128
2024-07-11 08:30:45,289888,289424,99%,0.12K,9059,32,36236K,kmalloc-128
2024-07-11 08:30:55,289888,288331,99%,0.12K,9059,32,36236K,kmalloc-128
2024-07-11 08:31:05,289888,288423,99%,0.12K,9059,32,36236K,kmalloc-128
2024-07-11 08:31:15,289888,288569,99%,0.12K,9059,32,36236K,kmalloc-128
2024-07-11 08:31:25,289888,288622,99%,0.12K,9059,32,36236K,kmalloc-128
2024-07-11 08:31:35,289888,288691,99%,0.12K,9059,32,36236K,kmalloc-128
2024-07-11 08:31:45,289888,288690,99%,0.12K,9059,32,36236K,kmalloc-128
2024-07-11 08:31:55,289728,286708,98%,0.12K,9054,32,36216K,kmalloc-128
2024-07-11 08:32:05,289728,286632,98%,0.12K,9054,32,36216K,kmalloc-128
2024-07-11 08:32:15,289728,286932,99%,0.12K,9054,32,36216K,kmalloc-128
2024-07-11 08:32:25,289728,286731,98%,0.12K,9054,32,36216K,kmalloc-128
2024-07-11 08:32:35,289728,286887,99%,0.12K,9054,32,36216K,kmalloc-128
2024-07-11 08:32:45,289728,287166,99%,0.12K,9054,32,36216K,kmalloc-128
2024-07-11 08:32:55,289728,286665,98%,0.12K,9054,32,36216K,kmalloc-128
2024-07-11 08:33:05,289728,286735,98%,0.12K,9054,32,36216K,kmalloc-128
2024-07-11 08:33:15,289728,286852,99%,0.12K,9054,32,36216K,kmalloc-128
2024-07-11 08:33:25,289728,286956,99%,0.12K,9054,32,36216K,kmalloc-128
2024-07-11 08:33:35,289728,287126,99%,0.12K,9054,32,36216K,kmalloc-128
2024-07-11 08:33:45,289728,287139,99%,0.12K,9054,32,36216K,kmalloc-128
2024-07-11 08:33:55,289728,286658,98%,0.12K,9054,32,36216K,kmalloc-128
2024-07-11 08:34:05,289728,286857,99%,0.12K,9054,32,36216K,kmalloc-128
2024-07-11 08:34:15,289728,287037,99%,0.12K,9054,32,36216K,kmalloc-128
2024-07-11 08:34:25,289728,287071,99%,0.12K,9054,32,36216K,kmalloc-128
2024-07-11 08:34:35,289728,286964,99%,0.12K,9054,32,36216K,kmalloc-128
2024-07-11 08:34:45,289728,287286,99%,0.12K,9054,32,36216K,kmalloc-128
2024-07-11 08:34:55,289696,286816,99%,0.12K,9053,32,36212K,kmalloc-128
2024-07-11 08:35:05,289696,287064,99%,0.12K,9053,32,36212K,kmalloc-128
2024-07-11 08:35:15,289696,287065,99%,0.12K,9053,32,36212K,kmalloc-128
2024-07-11 08:35:25,289696,287285,99%,0.12K,9053,32,36212K,kmalloc-128
2024-07-11 08:35:35,289696,287281,99%,0.12K,9053,32,36212K,kmalloc-128
2024-07-11 08:35:45,289696,287254,99%,0.12K,9053,32,36212K,kmalloc-128
2024-07-11 08:35:55,289696,287369,99%,0.12K,9053,32,36212K,kmalloc-128
2024-07-11 08:36:05,289696,286773,98%,0.12K,9053,32,36212K,kmalloc-128
2024-07-11 08:36:15,289696,286659,98%,0.12K,9053,32,36212K,kmalloc-128
2024-07-11 08:36:25,289696,286886,99%,0.12K,9053,32,36212K,kmalloc-128
2024-07-11 08:36:35,289696,286754,98%,0.12K,9053,32,36212K,kmalloc-128
2024-07-11 08:36:45,289696,286765,98%,0.12K,9053,32,36212K,kmalloc-128
2024-07-11 08:36:55,289696,286868,99%,0.12K,9053,32,36212K,kmalloc-128
2024-07-11 08:37:05,289696,286822,99%,0.12K,9053,32,36212K,kmalloc-128
2024-07-11 08:37:15,289696,286805,99%,0.12K,9053,32,36212K,kmalloc-128
2024-07-11 08:37:25,289696,286810,99%,0.12K,9053,32,36212K,kmalloc-128
2024-07-11 08:37:35,289696,286887,99%,0.12K,9053,32,36212K,kmalloc-128
2024-07-11 08:37:45,289696,286791,98%,0.12K,9053,32,36212K,kmalloc-128
2024-07-11 08:37:55,289696,286866,99%,0.12K,9053,32,36212K,kmalloc-128
2024-07-11 08:38:05,289696,286723,98%,0.12K,9053,32,36212K,kmalloc-128
2024-07-11 08:38:16,289696,286723,98%,0.12K,9053,32,36212K,kmalloc-128
2024-07-11 08:38:26,289696,286748,98%,0.12K,9053,32,36212K,kmalloc-128
2024-07-11 08:38:36,289696,286737,98%,0.12K,9053,32,36212K,kmalloc-128
2024-07-11 08:38:46,289696,287061,99%,0.12K,9053,32,36212K,kmalloc-128
2024-07-11 08:38:56,289696,286937,99%,0.12K,9053,32,36212K,kmalloc-128
2024-07-11 08:39:06,289696,286770,98%,0.12K,9053,32,36212K,kmalloc-128
2024-07-11 08:39:16,289696,286851,99%,0.12K,9053,32,36212K,kmalloc-128
  • Test Duration: 10 min
  • Start CACHE_SIZE: 36024K
  • End CACHE_SIZE: 36212K
  • Diff: 188K

As you can see, kmalloc-128 cache size has hardly increased. To summarize what has happened so far, I believe that the most suspicious parts are nvinfer-server and tritonserver.

I would appreciate it if you could continue your investigation.
Thank you.

OK. Thanks. I have tried both nvinfer and nvinferserver on my side. Kmalloc-128 cache size has hardly increased after removing pgie from the pipeline in both these 2 scenarios. We are continuing our investigation.