Is there a memory leak in CUDA

mmetz_nv · June 5, 2008, 1:45pm

Hi,

I think that there is/might be a memory leak in CUDA under linux, and my guess is that it is related to device context creation/deletion. First let be explain why I think so. I have an application that runs 2 or more threads on the host. In each thread, some code is executed on the GPU (cuFFT + one kernel), then the thread finishes, and new threads are started in a loop. Everything works fine, unless I let my program run really long (I need to do that). Then I can reproduce a crash, always after the the same number of loop iterations in my program (takes ~3 1/2 hours), and it reproducibly crashes:

Now, I wrote a small test program, that basically only executes the GPU part, and I checked it with valgrind, and that’s what I get:

==15340==

==15340== 512 bytes in 1 blocks are definitely lost in loss record 8 of 10

==15340== at 0x4C20FAB: malloc (vg_replace_malloc.c:207)

==15340== by 0x53DB156: (within /usr/lib/libcuda.so.169.12)

==15340== by 0x53E6622: (within /usr/lib/libcuda.so.169.12)

==15340== by 0x53CC44E: (within /usr/lib/libcuda.so.169.12)

==15340== by 0x53C6665: cuCtxCreate (in /usr/lib/libcuda.so.169.12)

==15340== by 0x4E4D993: (within /usr/local/cuda/lib/libcudart.so.1.1)

==15340== by 0x4E4E006: (within /usr/local/cuda/lib/libcudart.so.1.1)

==15340== by 0x4E372AE: cudaMalloc (in /usr/local/cuda/lib/libcudart.so.1.1)

==15340== by 0x4024E0: init_greens(float const*, unsigned) (in /export/aibn44_3/mmetz/nvidia_cuda_sdk/projects/suboGPU/stress)

==15340== by 0x4016B4: main (in /export/aibn44_3/mmetz/nvidia_cuda_sdk/projects/suboGPU/stress)

==15340==

==15340==

==15340== 512 bytes in 1 blocks are definitely lost in loss record 9 of 10

==15340== at 0x4C20FAB: malloc (vg_replace_malloc.c:207)

==15340== by 0x53DAD96: (within /usr/lib/libcuda.so.169.12)

==15340== by 0x53E66E6: (within /usr/lib/libcuda.so.169.12)

==15340== by 0x53CC44E: (within /usr/lib/libcuda.so.169.12)

==15340== by 0x53C6665: cuCtxCreate (in /usr/lib/libcuda.so.169.12)

==15340== by 0x4E4D993: (within /usr/local/cuda/lib/libcudart.so.1.1)

==15340== by 0x4E4E006: (within /usr/local/cuda/lib/libcudart.so.1.1)

==15340== by 0x4E372AE: cudaMalloc (in /usr/local/cuda/lib/libcudart.so.1.1)

==15340== by 0x4024E0: init_greens(float const*, unsigned) (in /export/aibn44_3/mmetz/nvidia_cuda_sdk/projects/suboGPU/stress)

==15340== by 0x4016B4: main (in /export/aibn44_3/mmetz/nvidia_cuda_sdk/projects/suboGPU/stress)

==15340==

==15340== LEAK SUMMARY:

==15340== definitely lost: 1,216 bytes in 4 blocks.

==15340== possibly lost: 0 bytes in 0 blocks.

==15340== still reachable: 1,218 bytes in 16 blocks.

==15340== suppressed: 0 bytes in 0 blocks.

As you can see, memory gets allocated when a device Context is created (the cuCtxCreate call), that is not freed. So my guess it that, because I create lots and lots of threads, some blocks of memory get allocated each time – and they are not correctly freed.

@NVIDIA: Any ideas ???

I use CUDA 1.1, driver version 169.12 on an 64bit linux box.

Manuel

netllama · June 5, 2008, 1:58pm

If this problem persists with the CUDA_2.0-beta, please provide a test app which reproduces the problem, along with an nvidia-bug-report.log.

mmetz_nv · June 6, 2008, 7:32am

Hm, I have a debian machine and did use the Ubuntu CUDA 1.1 package, but there are no Ubunutu packages of the 2.0beta available. Which should I use ???

Moreover, I did some more tests. In a new version of the full program I create a device context with

cuCtxCreate(&_ctx, 0, _dev);

and free it at the end with

cuCtxDetach(_ctx);

Still, the program exits faulty at the same loopstep. Here are the important lines of the valgrind output after ONE loop:

==9664==

==9664== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 8 from 1)

==9664== malloc/free: in use at exit: 80,215 bytes in 903 blocks.

==9664== malloc/free: 1,317,605 allocs, 1,316,702 frees, 605,698,244 bytes allocated.

==9664== For counts of detected errors, rerun with: -v

==9664== searching for pointers to 903 not-freed blocks.

==9664== checked 600,512 bytes.

==9664==

==9664== 256 bytes in 4 blocks are definitely lost in loss record 5 of 11

==9664== at 0x4C20FAB: malloc (vg_replace_malloc.c:207)

==9664== by 0x4E5B462: (within /usr/lib/libcuda.so.169.12)

==9664== by 0x4E4144E: (within /usr/lib/libcuda.so.169.12)

==9664== by 0x4E3B665: cuCtxCreate (in /usr/lib/libcuda.so.169.12)

==9664== by 0x4041B2: _ZL13running_step1Pv (in sbpp)

==9664== by 0x5BE5FC6: start_thread (in /lib/libpthread-2.7.so)

==9664== by 0x667078C: clone (in /lib/libc-2.7.so)

==9664==

==9664==

==9664== 512 bytes in 4 blocks are definitely lost in loss record 7 of 11

==9664== at 0x4C20FAB: malloc (vg_replace_malloc.c:207)

==9664== by 0x4E5B46D: (within /usr/lib/libcuda.so.169.12)

==9664== by 0x4E4144E: (within /usr/lib/libcuda.so.169.12)

==9664== by 0x4E3B665: cuCtxCreate (in /usr/lib/libcuda.so.169.12)

==9664== by 0x4041B2: _ZL13running_step1Pv (in sbpp)

==9664== by 0x5BE5FC6: start_thread (in /lib/libpthread-2.7.so)

==9664== by 0x667078C: clone (in /lib/libc-2.7.so)

==9664==

==9664==

==9664== 2,048 bytes in 4 blocks are definitely lost in loss record 9 of 11

==9664== at 0x4C20FAB: malloc (vg_replace_malloc.c:207)

==9664== by 0x4E4FD96: (within /usr/lib/libcuda.so.169.12)

==9664== by 0x4E5B6E6: (within /usr/lib/libcuda.so.169.12)

==9664== by 0x4E4144E: (within /usr/lib/libcuda.so.169.12)

==9664== by 0x4E3B665: cuCtxCreate (in /usr/lib/libcuda.so.169.12)

==9664== by 0x4041B2: _ZL13running_step1Pv (in sbpp)

==9664== by 0x5BE5FC6: start_thread (in /lib/libpthread-2.7.so)

==9664== by 0x667078C: clone (in /lib/libc-2.7.so)

==9664==

==9664==

==9664== 2,048 bytes in 4 blocks are definitely lost in loss record 10 of 11

==9664== at 0x4C20FAB: malloc (vg_replace_malloc.c:207)

==9664== by 0x4E50156: (within /usr/lib/libcuda.so.169.12)

==9664== by 0x4E5B622: (within /usr/lib/libcuda.so.169.12)

==9664== by 0x4E4144E: (within /usr/lib/libcuda.so.169.12)

==9664== by 0x4E3B665: cuCtxCreate (in /usr/lib/libcuda.so.169.12)

==9664== by 0x4041B2: _ZL13running_step1Pv (in sbpp)

==9664== by 0x5BE5FC6: start_thread (in /lib/libpthread-2.7.so)

==9664== by 0x667078C: clone (in /lib/libc-2.7.so)

==9664==

==9664== LEAK SUMMARY:

==9664== definitely lost: 4,864 bytes in 16 blocks.

==9664== possibly lost: 0 bytes in 0 blocks.

==9664== still reachable: 75,351 bytes in 887 blocks.

==9664== suppressed: 0 bytes in 0 blocks.

So this makes me thinking that not every memory block that is allocated by calling cuCtxCreate is freed when calling cuCtxDetach :unsure:

As you can see, there is memory lost that is allocated by /usr/lib/libcuda.so.169.12 and not freed at the end. This library, libcuda.so.169.12, is provided by the driver and not by the CUDA package, isn’t it? So my next step will be to update the driver … Let’s see what happens… I’ll keep you up to date …

Update: External Media No change when switching to 173.14.05; same valgrind output, same amount of memory lost …

mmetz_nv · June 6, 2008, 9:19am

So, now I can supply you with a very simple example application that shows the memory leak in CUDA 1.1. The source is attached.

What the code does is simply allocating memory on the device, copy some data to it and free the memory again. By this, a device context is created implicitly. There is a second thing the code does, namely deliberately producing a memleak by allocating 128Bytes and NOT freeing it again. This is to demonstrate the effect of a memleak. Check the difference by removing the comments of the last free().

Here are the main lines of the resulting valgrind output concerning the memleak:

==14759==

==14759== malloc/free: in use at exit: 5,459 bytes in 31 blocks.

==14759== malloc/free: 26,444 allocs, 26,413 frees, 69,658,360 bytes allocated.

==14759==

==14759== searching for pointers to 31 not-freed blocks.

==14759== checked 4,717,744 bytes.

==14759==

==14759==

==14759== 64 bytes in 1 blocks are definitely lost in loss record 7 of 17

==14759== at 0x4C20FAB: malloc (vg_replace_malloc.c:207)

==14759== by 0x8779D42: (within /usr/lib/libcuda.so.173.14.05)

==14759== by 0x875FB6E: (within /usr/lib/libcuda.so.173.14.05)

==14759== by 0x8759D85: cuCtxCreate (in /usr/lib/libcuda.so.173.14.05)

==14759== by 0x4E4D993: (within /usr/local/cuda/lib/libcudart.so.1.1)

==14759== by 0x4E4E006: (within /usr/local/cuda/lib/libcudart.so.1.1)

==14759== by 0x4E372AE: cudaMalloc (in /usr/local/cuda/lib/libcudart.so.1.1)

==14759== by 0x4016CA: main (in /tmp/memleaktest)

==14759==

==14759==

==14759== 128 bytes in 1 blocks are definitely lost in loss record 10 of 17

==14759== at 0x4C20FAB: malloc (vg_replace_malloc.c:207)

==14759== by 0x8779D4D: (within /usr/lib/libcuda.so.173.14.05)

==14759== by 0x875FB6E: (within /usr/lib/libcuda.so.173.14.05)

==14759== by 0x8759D85: cuCtxCreate (in /usr/lib/libcuda.so.173.14.05)

==14759== by 0x4E4D993: (within /usr/local/cuda/lib/libcudart.so.1.1)

==14759== by 0x4E4E006: (within /usr/local/cuda/lib/libcudart.so.1.1)

==14759== by 0x4E372AE: cudaMalloc (in /usr/local/cuda/lib/libcudart.so.1.1)

==14759== by 0x4016CA: main (in /tmp/memleaktest)

==14759==

==14759==

==14759== 128 bytes in 1 blocks are definitely lost in loss record 11 of 17

==14759== at 0x4C20FAB: malloc (vg_replace_malloc.c:207)

==14759== by 0x40166B: main (in /tmp/memleaktest)

==14759==

==14759==

==14759== 512 bytes in 1 blocks are definitely lost in loss record 14 of 17

==14759== at 0x4C20FAB: malloc (vg_replace_malloc.c:207)

==14759== by 0x876E5E6: (within /usr/lib/libcuda.so.173.14.05)

==14759== by 0x8779FC6: (within /usr/lib/libcuda.so.173.14.05)

==14759== by 0x875FB6E: (within /usr/lib/libcuda.so.173.14.05)

==14759== by 0x8759D85: cuCtxCreate (in /usr/lib/libcuda.so.173.14.05)

==14759== by 0x4E4D993: (within /usr/local/cuda/lib/libcudart.so.1.1)

==14759== by 0x4E4E006: (within /usr/local/cuda/lib/libcudart.so.1.1)

==14759== by 0x4E372AE: cudaMalloc (in /usr/local/cuda/lib/libcudart.so.1.1)

==14759== by 0x4016CA: main (in /tmp/memleaktest)

==14759==

==14759==

==14759== 512 bytes in 1 blocks are definitely lost in loss record 15 of 17

==14759== at 0x4C20FAB: malloc (vg_replace_malloc.c:207)

==14759== by 0x876E9A6: (within /usr/lib/libcuda.so.173.14.05)

==14759== by 0x8779F02: (within /usr/lib/libcuda.so.173.14.05)

==14759== by 0x875FB6E: (within /usr/lib/libcuda.so.173.14.05)

==14759== by 0x8759D85: cuCtxCreate (in /usr/lib/libcuda.so.173.14.05)

==14759== by 0x4E4D993: (within /usr/local/cuda/lib/libcudart.so.1.1)

==14759== by 0x4E4E006: (within /usr/local/cuda/lib/libcudart.so.1.1)

==14759== by 0x4E372AE: cudaMalloc (in /usr/local/cuda/lib/libcudart.so.1.1)

==14759== by 0x4016CA: main (in /tmp/memleaktest)

==14759==

==14759== LEAK SUMMARY:

==14759== definitely lost: 1,344 bytes in 5 blocks.

==14759== possibly lost: 0 bytes in 0 blocks.

==14759== still reachable: 4,115 bytes in 26 blocks.

==14759== suppressed: 0 bytes in 0 blocks.

The third message is the forced memleak; exactly those 128Bytes allocated but not freed are detected to be lost.

So in conclusion, I dare to say :blink: that there IS indeed a memleak in CUDA or in the underlying driver.

[Update] There is even a more simple example, see the second listing. This also produced memory leaks … [/Update]

@NVIDIA: If there is anything else I can do, please let me know. In any case, it would be really great if this memleak could be fixed soon.

#include <stdlib.h>

#include <string.h>

#include <math.h>

#include <cutil.h>

#include <cufft.h>

// run the program with

//   valgrind --leak-check=full --log-file=memleak.log memleaktest

int main()

{

    printf("This is memleaktest\n");

    const int N = 64;

    const int totalsize = N*N*N;

    

    cufftComplex* h_data = NULL;

    float* to_be_lost = NULL;

    

    // first allocate memory and fill it with some data

    h_data = (cufftComplex*) malloc(totalsize*sizeof(cufftComplex));

    

    printf("allocate memory %d\n",32*sizeof(float));

    to_be_lost = (float*) malloc(32*sizeof(float));

    srand(2006);

    

    for(int i=0; i<totalsize; i++) {

        h_data[i].x = rand();

        h_data[i].y = rand();

    }

    to_be_lost[0]=0.0; // avoid a compiler warning by once accessing the data

    

    // allocate memory on device

    cufftComplex* d_data = NULL;

    CUDA_SAFE_CALL( cudaMalloc( (void**)&d_data, sizeof(cufftComplex)*totalsize ) );

    

    // copy host memory to device

    CUDA_SAFE_CALL( cudaMemcpy(d_data, h_data, sizeof(cufftComplex)*totalsize, cudaMemcpyHostToDevice) );

    

    CUDA_SAFE_CALL( cudaFree(d_data) );

    free(h_data);

    // remove this comment to see the difference !!!

    //free(to_be_lost);

}

#include <iostream>

#include <cuda.h>

#include <cutil.h>

using namespace std;

int main() {

    CUdevice cuDevice;

    CUcontext cuContext;

    

    CUT_DEVICE_INIT_DRV(cuDevice);

    

    CUresult status = cuCtxCreate(&cuContext, 0, cuDevice);

    if ( CUDA_SUCCESS != status )

        cout << "error creating CUDA device context" << endl;

    cuCtxDetach(cuContext);

}

netllama · June 6, 2008, 4:34pm

Thanks for the information. Testing against our latest internal development driver yields the following results:

==16991== LEAK SUMMARY:
==16991== definitely lost: 0 bytes in 0 blocks.
==16991== possibly lost: 0 bytes in 0 blocks.
==16991== still reachable: 3,289 bytes in 12 blocks.
==16991== suppressed: 0 bytes in 0 blocks

Therefore, this appears to be fixed, and should be fixed in the 2.0-final release.

mmetz_nv · June 6, 2008, 4:43pm

That sounds great! Thanks! Have you been able to test it under 64Bit linux - because I’m using that? And just wondering, do you know whether the bug was in the kernel driver (as libcuda is part of the driver, I think) or in the SDK?

[Update] I tried to install the latest beta driver, but it failed to compile. See http://forums.nvidia.com/index.php?showtopic=65356

netllama · June 11, 2008, 4:49pm

Actually, I misspoke earlier. While most of the leaks are resolved, there is at least one remaining leak which is currently being tracked in bug 408311.

libcuda.so is the CUDA driver, however the only Linux kernel driver is nvidia.ko. The leak here is in libcuda.so.

The CUDA_2.0-final driver will build & install without any issues on a 2.6.25.x kernel.

Topic		Replies	Views
Memory Leak in CUDA 3.0 CUDA Programming and Performance	0	1114	July 9, 2010
Unexpected leak CUDA Programming and Performance	9	5901	October 13, 2008
cuModuleLoad valgrind warnings CUDA Programming and Performance	2	1796	October 26, 2009
Memory leaks in libcudart 4.2.9 or misuse? CUDA Programming and Performance	2	2130	June 7, 2012
Possible bug on beta 3.0 when using cufft and driver api CUDA Programming and Performance	4	2487	February 4, 2010
memory leak on cudaGetDeviceCount ? CUDA Programming and Performance	1	6711	October 14, 2009
FAO: Nvidia Engineers:- Memory Leak in cudaMemcpyAsync Only occurs on Host To Device memory transfer CUDA Programming and Performance	4	5872	August 18, 2010
Huge memory leak CUDA Programming and Performance	16	5618	July 27, 2016
`cuCtxCreate` and `cuCtxDestroy` pairs have a memory leak CUDA Programming and Performance cuda , problem	9	1210	January 11, 2024
cuInit() memory leak CUDA Programming and Performance cuda	5	842	February 17, 2025

Is there a memory leak in CUDA

Related topics