Video Decode Memory Leaks and Escalations

charles.yates · October 9, 2024, 7:14pm

Hi,

We are using the Video Codec SDK to play and loop video and we have noticed that while looping, memory usage escalates and appears to be uncapped. Additionally, when closing the cuda context associated to the decode, there is a very, very minor memory leak.

Files

The files mentioned below (AppDec.cpp, watchmem and windows.csv) are found in the attached zip file:

appdec.zip (7.3 KB)

AppDec

To isolate the issues, I have made the following amendments to the AppDec.cpp sample:

introduced a means to loop the input by way of -loop n switch (default 1)
added a cuCtxDestroy call to avoid leaks
introduced a means to sleep after the loops are complete with -delay ms (default 0)
introduced a means to register multiple inputs using the existing -input switch
introduced a means to iterate over the inputs, loops and sleeps using -iterate n (default 1)

By default, its behaviour is unchanged, but the additional switches make it easier to create long running tests which can be tracked with external profiling and other tools (such as valgrind’s memcheck and massif tools).

To build, follow the instructions in the Video_Codec_SDK directory (this is based on 12.2.72), overwriting Samples/AppDecode/AppDec/AppDec.cpp before building the samples.

watchmem

I have also attached a python script called watchmem which takes as arguments the name of the process you want to track (in this case, AppDec on linux, AppDec.exe on windows) and outputs a csv file on stdout - an example of use of this in cygwin with the windows version of python is:

$ python watchmem AppDec.exe 0.1 5.0 | tee windows.csv

Note that this will simply block until AppDec is started (it will poll for a new process every 0.1s and when found, will create a snapshot every 5 seconds).

To start the AppDec, run this in another terminal (replacing video.mp4 with a local file of course):

$ Release/AppDec -i video.mp4 -gpu 0 -o nul -loop 250 -delay 10000 -iterations 10 > /dev/null

and watchmem will produce a windows.csv file with the following type of content:

snapshot time memory average min max cpu threads
#header: Release\AppDec.exe -i video.mp4 -gpu 0 -o nul -loop 250 -delay 10000 -iterations 10
#footer: started at Wed Oct  9 19:31:53 2024
1 0.30111 0.30842 0.30842 0.30842 0.30842 0.00000 6
2 5.30339 0.31139 0.30990 0.30842 0.31139 89.70000 6
3 10.30786 0.31311 0.31097 0.30842 0.31311 92.20000 6
4 15.31287 0.31911 0.31301 0.30842 0.31911 90.60000 6
5 20.31754 0.32123 0.31465 0.30842 0.32123 93.10000 6
6 25.32295 0.32334 0.31610 0.30842 0.32334 92.20000 6
7 30.32749 0.33063 0.31818 0.30842 0.33063 92.80000 3
etc

Using the Output

After or during the run, you can render graphs from the generated windows.csv file using gnuplot (or similar tools such as numpy) to get various views of the memory consumption while the process is running - for example:

gnuplot -p -e 'set key autotitle columnhead; plot "windows.csv" using 2:6 with lines, "" using 2:3 with lines'

In this graph, we can that we see 10 periods of memory escalation (matching the 10 iterations), followed by a sharp drop where most of the memory is freed up (matching the 10,000 ms delay), but if you look closely at the max line, you can see that it is slowly increasing which indicates a very small leak (I checked our app under linux with valgrind’s memcheck and that indicated that there are leaks in libcuda - there is certainly a fixed leak too, but I’m fine with that). My main concern is the unchecked memory escalation during the repeated playout.

For this sample, the max memory use is more easily detectable in isolation:

Summary

I thought the issue worth reporting and others may find the modified AppDec sample useful for stress testing the decode functionality.

Note that the behaviour is more or less the same on linux (and I for one find it easier to test things there).

Hope that is of use of to somebody - feel free to reply if more information is required.

Cheers,

Charlie

val.zapod.vz · October 11, 2024, 4:01pm

“Additionally, when closing the cuda context associated to the decode, there is a very, very minor memory leak.”

How minor? Are you sure it is enough to just destroy CUDA context?

charles.yates · October 11, 2024, 5:47pm

Good question - I’ve attached a valgrind memcheck output to this reply - feel free to analyse it.

My analysis of it is as follows:

The majority of the “still reachable” memory allocations stem from AppDec.cpp:294 - which is the repeated call to ck(cuInit(0)) - the repetition of the call seems innocuous (feel free to guard with std::call_once or use of a simple static var if you want to confirm this) - the resultant leaks associated to this call appear to be fixed (and are of no concern to me).
Another minor fixed leak stems from AppDec.cpp:29 - allocation of a logging context which is not freed up on exit - fixed, and again is of no concern.
AppDec.cpp:305 and the related AppDec.cpp:72 are the problematic ones - yes, it calls into NvDecoder.cpp and tbh, I didn’t chase the specifics in there as the leak matched what I saw when running the memcheck test in my own code (the log seems to suggest 48 bytes in this single iteration which seems pretty minor to me in comparison to the memory escalation associated to looping the file… though it was larger in my own application - up to about 512 bytes per context if memory serves).

I might be misinterpreting the memcheck output, or it might be misreporting - my hope was that someone with access to debug symbols might be able to dig a little further than I was able to.

memcheck.txt (336.7 KB)

Cheers,

Charlie

Topic		Replies	Views
Cuda-memory leak since Video Codec SKD 9.1 Windows drivers Video Processing & Optical Flow	7	1115	December 1, 2019
GPU memory leak CUDA Programming and Performance	2	2942	August 19, 2011
Memory Leak when using cuGraphicsD3D9RegisterResource CUDA Programming and Performance cuda	9	1320	October 18, 2023
cudaMemcpy leaks on TK1 Jetson TK1	4	1168	February 24, 2016
How to find leaks? cuda-gdb runs out of memory, but compute-sanitizer runs without erros CUDA-GDB	9	3739	March 22, 2023
`cuCtxCreate` and `cuCtxDestroy` pairs have a memory leak CUDA Programming and Performance cuda , problem	9	1171	January 11, 2024
Use NVidia to process GPU decode, Repeatedly placing memory will cause program crash Video Processing & Optical Flow	10	1335	June 25, 2018
Sample AppDecMultiFiles in VideoCodec SDK does not improve the performance Video Processing & Optical Flow	1	688	December 6, 2019
CUDA memory leak in sin / cos implementation (CUDA 3.0)? local memory not freed after kernel exits CUDA Programming and Performance	4	6295	August 17, 2010
Can start to parse video in specific frame(not the begin) of the video with codec library? Video Processing & Optical Flow	17	1966	October 12, 2021

Video Decode Memory Leaks and Escalations

Related topics