Potential memory leak - compute-sanitizer shows nothing

Hi Guys,

I develop an application which does image manipulations using cuda. I now saw that when I instantiate my application several times (e.g. in googletest) the cuda memory of my Nvidia Jetson Orin Nano runs out after several minutes and instantiations. I of course now suspect a memory leak, however when running my application with compute-sanitizer it does not show any leaks.

The output after only one instantiation with memory usage shown is as follows:

========= COMPUTE-SANITIZER
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test 
[ RUN      ] testDeinitMemoryFree
2024-08-28_09-49-29_162: [TESTLOG]: INFO: Cuda memory usage before initialization: 33.578042%
Initializing CUDA
Initializing CUDA
############### test code runs here - init and deinit ################
2024-08-28_09-49-42_936: [TESTLOG]: INFO: Cuda memory usage after initialization: 34.453110%
[       OK ] testDeinitMemoryFree (13952 ms)
[----------] 1 test (13952 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (13952 ms total)
[  PASSED  ] 1 test.
========= LEAK SUMMARY: 0 bytes leaked in 0 allocations
========= ERROR SUMMARY: 0 errors

the command line i used is:

sudo ./usr/local/cuda-11.4/bin/compute-sanitizer --tool memcheck --leak-check full --show-backtrace yes <testprogram>

If you look closely you see that memory usage after the test was run increased by ~1% … Anyways, compute-sanitizer does not recognize any leaked memory. Am I using the tool wrong, is there anything I could try to see where the increased memory usage is comming from? This pattern is consistent if I do init/deinit several more times until memory usage is at a level where I cant initialize cuda anymore because there is not enough memory left.

Additional Infos:

  • Its a camera application running on Nvidia Jetson
  • It uses libArgus for image aquisition and camera configuration
  • I use CudaHelper.h and ArgusSamples.h for initializing cuda for image aquisition.

I appreciate your help.

You might very well have a memory leak. There’s not enough information here to identify where it is. The compute-sanitizer tool does not track every possible form of a memory leak. For example, Jetson has physically unified memory - host and device memory refer to the same physical backing. Therefore, without further info about what you are doing, what you are testing, or what your printouts mean, its possible that the leak is through the use of a host API - which compute-sanitizer doesn’t track.

Usually, a leak can be traced to a specific sequence of API calls. Therefore divide-and-conquer is typically a fairly good strategy to narrow down the source of a leak.

To track down memory errors in host code, I would recommend using valgrind.

Hi !

Thanks for the answers. I know the description from me was not very detailed - sorry about that! The program is basically an API for the cameras which we test using googletest. The API does the buffer handling of the received images and also uses cuda to do some image optimizations. I will get deeper into the code and check if I can divide it further and pinpoint the location where the suspected leak is.

I also tried using valgrind but somehow it does not want to work and tells me there is an unhandled instruction:

ARM64 front end: load_store
disInstr(arm64): unhandled instruction 0xB8A18002
disInstr(arm64): 1011'1000 1010'0001 1000'0000 0000'0010
==3736== valgrind: Unrecognised instruction at address 0x4c6b958.
==3736==    at 0x4C6B958: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1.1)
==3736==    by 0x4BE9A7B: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1.1)
==3736==    by 0x4DC556B: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1.1)
==3736==    by 0x4C19013: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1.1)
==3736==    by 0x2B3F93: __cudart915 (in /var/tmp/testprog)
==3736==    by 0x60C53B7: __pthread_once_slow (pthread_once.c:116)
==3736==    by 0x2FE8BB: __cudart1186 (in /var/tmp/testprog)
==3736==    by 0x2AA62F: __cudart102 (in /var/tmp/testprog)
==3736==    by 0x2D6AFB: cudaMallocManaged (in /var/tmp/testprog)

It seems to not handle cudaMallocManaged correctly.

Have you tried non-managed memory? (It is faster anyway, not sure about Jetson)

Okay further insights I got.

As said we use googletest and while running one testcase with init/deinit (which supposedly have a memory leak) the cuda shared memory climbs up to ~3.5 -4 GB and then stays there consistently even with over 100 iterations. I also checked every cudaMalloc we are doing and for every allocation we do a deallocation using cudaFree. (cuda - Why doesn't CudaFree seem to free memory? - Stack Overflow) However when the testcase finishes and a second test case is started I get following error:

Initializing CUDA
NVMAP_IOC_GET_FD failed: Bad address
Error generated. /usr/src/jetson_multimedia_api/argus/samples/utils/CUDAHelper.cpp, initCUDA:81 Unable to initialize the CUDA driver API (CUresult unknown error)

the initializing CUDA output comes from the function CUDAHelper::initCuda() which I use to initialize the cuda context.

When the second testcase is started the memory usage does not drop and stays around ~3.5GB. I also observed that it does not climb higher and the free memory always stays around 1GB. I checked that using jetson-stats.

Unfortunately neither valgrind (which doess not work, see comment above) nor compute-sanitizer show any helpful output.

Ideally, you would be using RAII containers like thrust::device_vector which automatically allocates and frees the memory just like std::vector.

Hi,

comming back to this as I did more troubleshooting.

It seems that the problem is something else and not a memory leak. As said in my last comment we use googletest to test our camera streaming api. The tests run fine but after a certain amount of tests, where cuda was initialized using the CudaHelper function, the cuCtxCreate_v2 function fails with an unknown error as seen in my last post.
There is still plenty of memory available (>3GB) but still cuda does not seem to be able to create a new handle. It seems that it is able to run 25 tests and regardeless what comes after as the 26th test it fails

Is there any limit on how many handles an application can create? Why can I run a loop with 100 init/deinits and have no problem but when another testcase is started it crashes?

Is there any way to reset the handle count or increase/check any cuda related ressources??

Lots of questions - hopefully someone can help me further!

Thank you

If by “handle” you mean “context”, there is a limited number of device contexts that can be simultaneously resident on a GPU. Perhaps there is some resource that is not being properly destroyed. Perhaps your usage of the argus API does not involve a proper shutdown.

As a diagnostic, since you seem to be using the CUDA runtime API, (and I normally wouldn’t recommend this), you might try inserting a cudaDeviceReset() before

Hi Robert,

thanks for the answer, it was indeed the problem that we had multiple contexts open and didnt close all of them. We switched to only two contexts for our tests and destroyed them correctly while deinitializing and this fixed the memory usage issue!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.