Video Decode Memory Leaks and Escalations

Hi,

We are using the Video Codec SDK to play and loop video and we have noticed that while looping, memory usage escalates and appears to be uncapped. Additionally, when closing the cuda context associated to the decode, there is a very, very minor memory leak.

Files

The files mentioned below (AppDec.cpp, watchmem and windows.csv) are found in the attached zip file:

appdec.zip (7.3 KB)

AppDec

To isolate the issues, I have made the following amendments to the AppDec.cpp sample:

  • introduced a means to loop the input by way of -loop n switch (default 1)
  • added a cuCtxDestroy call to avoid leaks
  • introduced a means to sleep after the loops are complete with -delay ms (default 0)
  • introduced a means to register multiple inputs using the existing -input switch
  • introduced a means to iterate over the inputs, loops and sleeps using -iterate n (default 1)

By default, its behaviour is unchanged, but the additional switches make it easier to create long running tests which can be tracked with external profiling and other tools (such as valgrind’s memcheck and massif tools).

To build, follow the instructions in the Video_Codec_SDK directory (this is based on 12.2.72), overwriting Samples/AppDecode/AppDec/AppDec.cpp before building the samples.

watchmem

I have also attached a python script called watchmem which takes as arguments the name of the process you want to track (in this case, AppDec on linux, AppDec.exe on windows) and outputs a csv file on stdout - an example of use of this in cygwin with the windows version of python is:

$ python watchmem AppDec.exe 0.1 5.0 | tee windows.csv

Note that this will simply block until AppDec is started (it will poll for a new process every 0.1s and when found, will create a snapshot every 5 seconds).

To start the AppDec, run this in another terminal (replacing video.mp4 with a local file of course):

$ Release/AppDec -i video.mp4 -gpu 0 -o nul -loop 250 -delay 10000 -iterations 10 > /dev/null

and watchmem will produce a windows.csv file with the following type of content:

snapshot time memory average min max cpu threads
#header: Release\AppDec.exe -i video.mp4 -gpu 0 -o nul -loop 250 -delay 10000 -iterations 10
#footer: started at Wed Oct  9 19:31:53 2024
1 0.30111 0.30842 0.30842 0.30842 0.30842 0.00000 6
2 5.30339 0.31139 0.30990 0.30842 0.31139 89.70000 6
3 10.30786 0.31311 0.31097 0.30842 0.31311 92.20000 6
4 15.31287 0.31911 0.31301 0.30842 0.31911 90.60000 6
5 20.31754 0.32123 0.31465 0.30842 0.32123 93.10000 6
6 25.32295 0.32334 0.31610 0.30842 0.32334 92.20000 6
7 30.32749 0.33063 0.31818 0.30842 0.33063 92.80000 3
etc

Using the Output

After or during the run, you can render graphs from the generated windows.csv file using gnuplot (or similar tools such as numpy) to get various views of the memory consumption while the process is running - for example:

gnuplot -p -e 'set key autotitle columnhead; plot "windows.csv" using 2:6 with lines, "" using 2:3 with lines'

image

In this graph, we can that we see 10 periods of memory escalation (matching the 10 iterations), followed by a sharp drop where most of the memory is freed up (matching the 10,000 ms delay), but if you look closely at the max line, you can see that it is slowly increasing which indicates a very small leak (I checked our app under linux with valgrind’s memcheck and that indicated that there are leaks in libcuda - there is certainly a fixed leak too, but I’m fine with that). My main concern is the unchecked memory escalation during the repeated playout.

For this sample, the max memory use is more easily detectable in isolation:

image

Summary

I thought the issue worth reporting and others may find the modified AppDec sample useful for stress testing the decode functionality.

Note that the behaviour is more or less the same on linux (and I for one find it easier to test things there).

Hope that is of use of to somebody - feel free to reply if more information is required.

Cheers,

Charlie

“Additionally, when closing the cuda context associated to the decode, there is a very, very minor memory leak.”

How minor? Are you sure it is enough to just destroy CUDA context?

Good question - I’ve attached a valgrind memcheck output to this reply - feel free to analyse it.

My analysis of it is as follows:

  • The majority of the “still reachable” memory allocations stem from AppDec.cpp:294 - which is the repeated call to ck(cuInit(0)) - the repetition of the call seems innocuous (feel free to guard with std::call_once or use of a simple static var if you want to confirm this) - the resultant leaks associated to this call appear to be fixed (and are of no concern to me).
  • Another minor fixed leak stems from AppDec.cpp:29 - allocation of a logging context which is not freed up on exit - fixed, and again is of no concern.
  • AppDec.cpp:305 and the related AppDec.cpp:72 are the problematic ones - yes, it calls into NvDecoder.cpp and tbh, I didn’t chase the specifics in there as the leak matched what I saw when running the memcheck test in my own code (the log seems to suggest 48 bytes in this single iteration which seems pretty minor to me in comparison to the memory escalation associated to looping the file… though it was larger in my own application - up to about 512 bytes per context if memory serves).

I might be misinterpreting the memcheck output, or it might be misreporting - my hope was that someone with access to debug symbols might be able to dig a little further than I was able to.

memcheck.txt (336.7 KB)

Cheers,

Charlie