TX2NX Swap Filling Up Due To Possible Memory Leak

Issue:

We are currently running a two-staged DeepStream pipeline with PGIE (Object Detection) and SGIE (Object Detection). What we have observed is after a while the swap space on TX2NX (which is 2GB) fills up and so does the RAM. This causes the application to slow down and in some cases stop the downstream tasks like relay actuation and so forth.

The application architecture is slightly complex but the following are the Docker containers that run at the same time:

  • Django application for APIs
  • Kafka container for message passing
  • DeepStream based application
  • NGINX

Right after running the application, jtop looks like following:

After several hours of running (usually 4 - 5), jtop looks like following:

As you can see the memory has gone up and so has swap space.

What we have tried

Profiling Tools
We have tried using several tools to profile the applications like valgrind, cuda-memcheck, heaptrack and have not found any significant leaks in the application.

  • Valgrind: detects 800+MB leaks in libcuda.so. But according to our research (refer link) valgrind reports false positives with cuda so we didn’t take this very seriously.
  • cuda-memcheck: no leaks/errors
  • heaptrack: no major memory leaks (detects ~6MB of memory leak in gstreamer which should be OK)

Restarting containers
We have also tried restarting certain containers to boil down the issue as restarting the containers frees a certain amount of swap memory. The exact memory freed in most cases is arbitrary and has the following range:

  • Restarting Kafka: frees 600 MB - 1.1 GB of Swap
  • Restarting DeepStream: frees 500 - 1.5 GB of Swap
  • Restarting APIs, and NGINX have almost no effect.

Swappiness
We have also things like setting the swappiness using vm.swappiness in /etc/sysctl.conf but it didnt change anything.

Disabling custom processing and postprocessing
To rule out the possibility of memleak in our custom code, we disabled all gstreamer probes, and even bbox parsing functions where the only thing that was allowed to run was the DeepStream pipeline (only PGIE would run at this point, as we disabled bbox parsing, no objects would reach SGIE) and we still observed slow increase in the swap space. This strengthens argument (1) in suspicion below.

Suspicion

  1. Memory leak in DeepStream
    I am 99% confident that there are no leaks in the custom code that is developed. For critical components like buffer conversion and creation, we are making sure we use appropriate unmapping, and destruction of streams.
  2. Write caching
    I have read that Jetson devices write to swap memory first before flushing to disk to improve latency. We are using SPD-logger extensively to flush logs so maybe too much disk I/O is causing Jetson to cache in the swap space?

• Hardware Platform: Jetson TX2NX
• DeepStream Version: 5.1
• JetPack Version: 4.5.1
• Issue Type: bug/question

Experiments I tried:

  1. Running the same models that I have using deepstream-app in this case the memory does not go up (monitored for 2 hours)

  2. Ran my application OUTSIDE of docker, in this case the memory consumption speed is slower. Eventually it does go up but it’s a lot slower outside of docker than inside of docker. Is this an issue?

I have read up on other alternatives to Kafka like Mosquitto that are more lightweight and suited for IOT devices, would that make sense using here? Looking forward to a response!

Hi @geralt_of_rivia ,
Sorry for delay!
Do you mean there is memory leak in DeepStream?
Could you use the script in DeepStream SDK FAQ - #14 by mchi to capture the memory usage log for some time to find out which kind of memory it leaks?

What’s your pipeline?
And, in Valgrind log, did you see other suspicious leak log?
I think, you can use Valgrind to run the application for one hour and three hours respectively, and compare the Valgrind log to find out the suspicious leak.

Thanks!