CUFFT library leaking memory?

TheMagpie · March 21, 2011, 7:50pm

On a large project that uses CUDA, I’m running valgrind to try to track down memory leaks. To make my life easier, I made a stand-alone program that replicates the scope of the large project’s CUDA operations:

Allocate memory on the GPU
Create a set of FFT plans
Create a number of CUDA streams and assign them to the FFT plans via cufftSetStream
Repeatedly perform FFT operations
Destroy streams
Destroy FFT plans
Free FFT plan memory
Free GPU memory

Before I was able to get very far with the stand-alone program, I ended up with this:

[font=“Courier New”]==621== LEAK SUMMARY:
==621== definitely lost: 40,816 bytes in 512 blocks
==621== indirectly lost: 48,113 bytes in 796 blocks
==621== possibly lost: 10,011,782 bytes in 4,781 blocks
==621== still reachable: 266,003 bytes in 3,397 blocks[/font]

The largest possibly lost blocks that valgrind complains about are in cuModuleLoadFatBinary:
[font=“Courier New”]by 0x719D458: cuModuleLoadFatBinary (in /usr/lib64/libcuda.so.260.19.26)[/font]

The program is essentially identical to the 1D Complex-to-Complex example in the CUFFT Library guide:

[font=“Courier New”]#include <cufft.h>
#define NX 256
#define BATCH 10
int main ()
{
cufftHandle plan;
cufftComplex *data;

cudaMalloc((void**)&data, sizeof(cufftComplex)*NX*BATCH);

cufftPlan1d(&plan, NX, CUFFT_C2C, BATCH);
cufftExecC2C(plan, data, data, CUFFT_FORWARD);
cufftExecC2C(plan, data, data, CUFFT_INVERSE);
cufftDestroy(plan); 

cudaFree(data);

return 0;

}[/font]

The program is compiled as:
[font=“Courier New”]nvcc testleak.cu -o testleak.o -lcufft[/font]

Valgrind is run as:
[font=“Courier New”]valgrind -v --leak-check=full ./testleak.o[/font]

Valgrind version is 3.5.0.

I’m using CUDA 3.2 (driver version 260.19.26, toolkit version 3.2.16).
The program is running on Fedora 12 [font=“Courier New”](2.6.32.26-175.fc12.x86_64 #1 SMP Wed Dec 1 21:39:34 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux).[/font]

Is the CUFFT library not being unloaded from memory in time for valgrind to see that it has been freed?

In the case of the larger project, I don’t see these large memory leaks at all, however all CUDA operations are done in a child pthread that is joined and destroyed before the program fully exits. The only leak I get from that is:

[font=“Courier New”]==24165== 784 (8 direct, 776 indirect) bytes in 1 blocks are definitely lost in loss record 36 of 48
==24165== at 0x4A0515D: malloc (vg_replace_malloc.c:195)
==24165== by 0xACE6E27: ??? (in /usr/local/cuda/lib64/libcufft.so.3.2.16)
==24165== by 0xACE6D37: ??? (in /usr/local/cuda/lib64/libcufft.so.3.2.16)
==24165== by 0xACE6EEB: ??? (in /usr/local/cuda/lib64/libcufft.so.3.2.16)
==24165== by 0xACE74B1: ??? (in /usr/local/cuda/lib64/libcufft.so.3.2.16)
==24165== by 0xACE78BA: cufftPlan1d (in /usr/local/cuda/lib64/libcufft.so.3.2.16)
etc.[/font]

In the code that generates the above report, I am calling cufftDestroy(plan[i]) in a loop to destroy all the plans, however, one of the plans can sometimes be of different size than the others: all but the last plan could be processing 5 segements of data with 128-point FFTs, whereas the last one would only get 4 blocks.

Should I be concerned about the leak in the large project? Are the FFT plans being correctly destroyed when streams are involved?