On a large project that uses CUDA, I’m running valgrind to try to track down memory leaks. To make my life easier, I made a stand-alone program that replicates the scope of the large project’s CUDA operations:
- Allocate memory on the GPU
- Create a set of FFT plans
- Create a number of CUDA streams and assign them to the FFT plans via cufftSetStream
- Repeatedly perform FFT operations
- Destroy streams
- Destroy FFT plans
- Free FFT plan memory
- Free GPU memory
Before I was able to get very far with the stand-alone program, I ended up with this:
[font=“Courier New”]==621== LEAK SUMMARY:
==621== definitely lost: 40,816 bytes in 512 blocks
==621== indirectly lost: 48,113 bytes in 796 blocks
==621== possibly lost: 10,011,782 bytes in 4,781 blocks
==621== still reachable: 266,003 bytes in 3,397 blocks[/font]
The largest possibly lost blocks that valgrind complains about are in cuModuleLoadFatBinary:
[font=“Courier New”]by 0x719D458: cuModuleLoadFatBinary (in /usr/lib64/libcuda.so.260.19.26)[/font]
The program is essentially identical to the 1D Complex-to-Complex example in the CUFFT Library guide:
[font=“Courier New”]#include <cufft.h>
#define NX 256
#define BATCH 10
int main ()
cudaMalloc((void**)&data, sizeof(cufftComplex)*NX*BATCH); cufftPlan1d(&plan, NX, CUFFT_C2C, BATCH); cufftExecC2C(plan, data, data, CUFFT_FORWARD); cufftExecC2C(plan, data, data, CUFFT_INVERSE); cufftDestroy(plan); cudaFree(data); return 0;
The program is compiled as:
[font=“Courier New”]nvcc testleak.cu -o testleak.o -lcufft[/font]
Valgrind is run as:
[font=“Courier New”]valgrind -v --leak-check=full ./testleak.o[/font]
Valgrind version is 3.5.0.
I’m using CUDA 3.2 (driver version 260.19.26, toolkit version 3.2.16).
The program is running on Fedora 12 [font=“Courier New”](18.104.22.168-175.fc12.x86_64 #1 SMP Wed Dec 1 21:39:34 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux).[/font]
Is the CUFFT library not being unloaded from memory in time for valgrind to see that it has been freed?
In the case of the larger project, I don’t see these large memory leaks at all, however all CUDA operations are done in a child pthread that is joined and destroyed before the program fully exits. The only leak I get from that is:
[font=“Courier New”]==24165== 784 (8 direct, 776 indirect) bytes in 1 blocks are definitely lost in loss record 36 of 48
==24165== at 0x4A0515D: malloc (vg_replace_malloc.c:195)
==24165== by 0xACE6E27: ??? (in /usr/local/cuda/lib64/libcufft.so.3.2.16)
==24165== by 0xACE6D37: ??? (in /usr/local/cuda/lib64/libcufft.so.3.2.16)
==24165== by 0xACE6EEB: ??? (in /usr/local/cuda/lib64/libcufft.so.3.2.16)
==24165== by 0xACE74B1: ??? (in /usr/local/cuda/lib64/libcufft.so.3.2.16)
==24165== by 0xACE78BA: cufftPlan1d (in /usr/local/cuda/lib64/libcufft.so.3.2.16)
In the code that generates the above report, I am calling cufftDestroy(plan[i]) in a loop to destroy all the plans, however, one of the plans can sometimes be of different size than the others: all but the last plan could be processing 5 segements of data with 128-point FFTs, whereas the last one would only get 4 blocks.
Should I be concerned about the leak in the large project? Are the FFT plans being correctly destroyed when streams are involved?