Is there a memory leak in CUDA


I think that there is/might be a memory leak in CUDA under linux, and my guess is that it is related to device context creation/deletion. First let be explain why I think so. I have an application that runs 2 or more threads on the host. In each thread, some code is executed on the GPU (cuFFT + one kernel), then the thread finishes, and new threads are started in a loop. Everything works fine, unless I let my program run really long (I need to do that). Then I can reproduce a crash, always after the the same number of loop iterations in my program (takes ~3 1/2 hours), and it reproducibly crashes:

Now, I wrote a small test program, that basically only executes the GPU part, and I checked it with valgrind, and that’s what I get:

As you can see, memory gets allocated when a device Context is created (the cuCtxCreate call), that is not freed. So my guess it that, because I create lots and lots of threads, some blocks of memory get allocated each time – and they are not correctly freed.

@NVIDIA: Any ideas ???

I use CUDA 1.1, driver version 169.12 on an 64bit linux box.


If this problem persists with the CUDA_2.0-beta, please provide a test app which reproduces the problem, along with an nvidia-bug-report.log.

Hm, I have a debian machine and did use the Ubuntu CUDA 1.1 package, but there are no Ubunutu packages of the 2.0beta available. Which should I use ???

Moreover, I did some more tests. In a new version of the full program I create a device context with

cuCtxCreate(&_ctx, 0, _dev);

and free it at the end with


Still, the program exits faulty at the same loopstep. Here are the important lines of the valgrind output after ONE loop:

So this makes me thinking that not every memory block that is allocated by calling cuCtxCreate is freed when calling cuCtxDetach :unsure:

As you can see, there is memory lost that is allocated by /usr/lib/ and not freed at the end. This library,, is provided by the driver and not by the CUDA package, isn’t it? So my next step will be to update the driver … Let’s see what happens… I’ll keep you up to date …

Update: :no: No change when switching to 173.14.05; same valgrind output, same amount of memory lost …

So, now I can supply you with a very simple example application that shows the memory leak in CUDA 1.1. The source is attached.

What the code does is simply allocating memory on the device, copy some data to it and free the memory again. By this, a device context is created implicitly. There is a second thing the code does, namely deliberately producing a memleak by allocating 128Bytes and NOT freeing it again. This is to demonstrate the effect of a memleak. Check the difference by removing the comments of the last free().

Here are the main lines of the resulting valgrind output concerning the memleak:

The third message is the forced memleak; exactly those 128Bytes allocated but not freed are detected to be lost.

So in conclusion, I dare to say :blink: that there IS indeed a memleak in CUDA or in the underlying driver.

[Update] There is even a more simple example, see the second listing. This also produced memory leaks … [/Update]

@NVIDIA: If there is anything else I can do, please let me know. In any case, it would be really great if this memleak could be fixed soon.

#include <stdlib.h>

#include <string.h>

#include <math.h>

#include <cutil.h>

#include <cufft.h>

// run the program with

//   valgrind --leak-check=full --log-file=memleak.log memleaktest

int main()


    printf("This is memleaktest\n");

    const int N = 64;

    const int totalsize = N*N*N;


    cufftComplex* h_data = NULL;

    float* to_be_lost = NULL;


    // first allocate memory and fill it with some data

    h_data = (cufftComplex*) malloc(totalsize*sizeof(cufftComplex));


    printf("allocate memory %d\n",32*sizeof(float));

    to_be_lost = (float*) malloc(32*sizeof(float));



    for(int i=0; i<totalsize; i++) {

        h_data[i].x = rand();

        h_data[i].y = rand();


    to_be_lost[0]=0.0; // avoid a compiler warning by once accessing the data


    // allocate memory on device

    cufftComplex* d_data = NULL;

    CUDA_SAFE_CALL( cudaMalloc( (void**)&d_data, sizeof(cufftComplex)*totalsize ) );


    // copy host memory to device

    CUDA_SAFE_CALL( cudaMemcpy(d_data, h_data, sizeof(cufftComplex)*totalsize, cudaMemcpyHostToDevice) );


    CUDA_SAFE_CALL( cudaFree(d_data) );


    // remove this comment to see the difference !!!


#include <iostream>

#include <cuda.h>

#include <cutil.h>

using namespace std;

int main() {

    CUdevice cuDevice;

    CUcontext cuContext;




    CUresult status = cuCtxCreate(&cuContext, 0, cuDevice);

    if ( CUDA_SUCCESS != status )

        cout << "error creating CUDA device context" << endl;



Thanks for the information. Testing against our latest internal development driver yields the following results:

==16991== LEAK SUMMARY:
==16991== definitely lost: 0 bytes in 0 blocks.
==16991== possibly lost: 0 bytes in 0 blocks.
==16991== still reachable: 3,289 bytes in 12 blocks.
==16991== suppressed: 0 bytes in 0 blocks

Therefore, this appears to be fixed, and should be fixed in the 2.0-final release.

That sounds great! Thanks! Have you been able to test it under 64Bit linux - because I’m using that? And just wondering, do you know whether the bug was in the kernel driver (as libcuda is part of the driver, I think) or in the SDK?

[Update] I tried to install the latest beta driver, but it failed to compile. See

Actually, I misspoke earlier. While most of the leaks are resolved, there is at least one remaining leak which is currently being tracked in bug 408311. is the CUDA driver, however the only Linux kernel driver is nvidia.ko. The leak here is in

The CUDA_2.0-final driver will build & install without any issues on a 2.6.25.x kernel.