Memory leak in cuFFT (cuda 5.0)?

I create a 1D FFT plan, and loop on the enqueue transform function, on the same exact memory over and over, and after a number of iterations, the exec calls give me CUFFT_EXEC_FAILED and the rest of my cuda calls fail.

I found that if I create and destroy the plan in my loop (which adds about 700usec of overhead to the loop) I do not crash. I can only guess the exec calls are leaking memory away…

Can anyone else confirm this?

Could you please paste your test case here?

I simplified my code down enough to post, there’s not much to it. It doesn’t seem that it’s actually a matter of when and where I create and destroy plans… I get inconsistent results either way.

Edit: struggling to post the whole code, forum seems to cut it off… here is the main portion:
Code:
2^16=64K pt FFT, loop of 10. my utils cuda file should be the same as the helper_cuda.h from the SDK samples.

// create GPU buffers
	checkCudaErrors( cudaMalloc((void**)&d_in,bytes_per_buffer) );

	// Move initialized data to GPU
	checkCudaErrors(cudaMemcpy(d_in, h_in, bytes_per_buffer, cudaMemcpyHostToDevice));
	cudaDeviceSynchronize();

	// Planning
	checkCudaErrors( cufftPlan1d( &plan, fft_n_elems, CUFFT_C2C, 1 ) );

	// testing
	cufftResult_t err;

	for ( int i=0; i<10; i++ )
	{
		err = cufftExecC2C(plan, d_in,d_in, CUFFT_FORWARD);
		cudaDeviceSynchronize();
		checkCudaErrors(err);
	}

	//Retrieve result,
	checkCudaErrors(cudaMemcpy(h_out, d_in, bytes_per_buffer, cudaMemcpyDeviceToHost)); // get result
	cudaDeviceSynchronize();

	printf("DONE
");

	// tear down
	checkCudaErrors(cufftDestroy(plan));
	free(h_in);
	free(h_out);
	checkCudaErrors(cudaFree(d_in) );
	cudaDeviceReset();
	return 0;
}

Build:
System: Dell M6600 Notebook, i7 gen3, RedHat Enterprise Linux 6.1, Nvidia Quadro 3000M (driver 304.54 and CUDA 5.0)

LD_LIBRARY_PATH=/usr/local/cuda-5.0/lib64 PATH=/h/dyablons/bin:/usr/ucb:/usr/local/bin:/usr/lib64/qt-3.3/bin:/etc/profile.d:/usr/local/bin:/bin:/usr/bin //usr/local/cuda-5.0/bin/nvcc -arch=sm_21 -m64  -I/usr/local/cuda-5.0/include nvidia_fft_cuda_t.cpp -o nvidia_fft_cuda.lnx_rh61_x86_64_cuda5.0__t -L/usr/local/cuda-5.0/lib64 -lcufft -lcuda -lcudart

My results (several runs, some work, some crash, different number of failures…):

[root@pm]# ./nvidia_fft_cuda.lnx_rh61_x86_64_cuda5.0_t 
Details of this test
  FFT size chosen: 65536 (logn 16)
  max mem available: 2146631680
  1 buffers will have 524288 bytes each
  which is 1 distinct fft's performed
CUDA error at nvidia_fft_cuda_t.cpp:82 code=6(CUFFT_EXEC_FAILED) "err" 
CUDA error at nvidia_fft_cuda_t.cpp:82 code=6(CUFFT_EXEC_FAILED) "err" 
CUDA error at nvidia_fft_cuda_t.cpp:82 code=6(CUFFT_EXEC_FAILED) "err" 
CUDA error at nvidia_fft_cuda_t.cpp:82 code=6(CUFFT_EXEC_FAILED) "err" 
CUDA error at nvidia_fft_cuda_t.cpp:82 code=6(CUFFT_EXEC_FAILED) "err" 
CUDA error at nvidia_fft_cuda_t.cpp:82 code=6(CUFFT_EXEC_FAILED) "err" 
CUDA error at nvidia_fft_cuda_t.cpp:82 code=6(CUFFT_EXEC_FAILED) "err" 
CUDA error at nvidia_fft_cuda_t.cpp:86 code=4(cudaErrorLaunchFailure) "cudaMemcpy(h_out, d_in, bytes_per_buffer, cudaMemcpyDeviceToHost)" 
DONE
CUDA error at nvidia_fft_cuda_t.cpp:95 code=4(cudaErrorLaunchFailure) "cudaFree(d_in)" 
[root@pm]# ./nvidia_fft_cuda.lnx_rh61_x86_64_cuda5.0_t 
Details of this test
  FFT size chosen: 65536 (logn 16)
  max mem available: 2146631680
  1 buffers will have 524288 bytes each
  which is 1 distinct fft's performed
CUDA error at nvidia_fft_cuda_t.cpp:82 code=6(CUFFT_EXEC_FAILED) "err" 
CUDA error at nvidia_fft_cuda_t.cpp:86 code=4(cudaErrorLaunchFailure) "cudaMemcpy(h_out, d_in, bytes_per_buffer, cudaMemcpyDeviceToHost)" 
DONE
CUDA error at nvidia_fft_cuda_t.cpp:95 code=4(cudaErrorLaunchFailure) "cudaFree(d_in)" 
[root@pm]# ./nvidia_fft_cuda.lnx_rh61_x86_64_cuda5.0_t 
Details of this test
  FFT size chosen: 65536 (logn 16)
  max mem available: 2146631680
  1 buffers will have 524288 bytes each
  which is 1 distinct fft's performed
DONE
[root@pm-sb-mob-vmc scripts]#

Help would be appreciated!

Sorry about the broken forum functionality. The two issues I am aware of and have reported are that everything after a “less than” character disappears (either the rest of the line, and sometimes the rest of the entire post), and that backslashes (in Windows path names, in C escape sequences, macro line continuation) disappear without a trace.

This makes the exact exchange of code (important for repro cases) nearly impossible in the forums at this time. If after some due diligence you suspect a bug in one of the CUDA components I would suggest filing a bug through the registered developer website, attaching self-contained repro code.

Thanks for the explanation on the forum functionality.

I built another machine just to test this code on something different and I don’t seem to have the crashes. I suspect it’s a driver/cuda install bug or some very specific system configuration issue. If I can get around to it, I will report the resolution. For now, I’ll continue with my working system.

Thanks all.

Bump… I’m still getting errors now on both systems.

I modified the example code from the SDK, just commented out everything but the FFT and changed it to 1 megapoint. I also removed the padding since I’m not going through with the convolution…

I’m still concerned that it is my RedHat (6.1) with cuda5.0 being an issue.

Here is the code, you should be able to overwrite your simpleCUFFT.cu file in the SDK with this file in its entirety. Then just run the original makefile.

Please let me know if it looks like I’m doing anything wrong here.

https://docs.google.com/document/d/1QbUqlk0WGq8Ey5JSs0NVeJ7jscoVJ0JIaQttcqr9Aiw/edit

I am not a CUFFT user and it is unlikely I could spot any API usage issues in your code. Is RedHat 6.1 on the list of supported platforms for CUDA 5.0 ?

If you are on a supported platfotm and you believe (after careful checking of your code) that there is a problem with the CUFFT library (such as a memory leak), please consider filing a bug report through the registered developer website. Please attach a self-contained repro app to the bug report.

Yes, 6.x is supported. I’m waiting for my registered developer registration to go through so I can submit a bug.

It would still please me if someone has a moment to throw tthe file linked to above in their SDK example (simpleCUFFT in 7_CUDALibraries it would reassure me a bit.

Thanks