cufft error (?)

Hello everybody!

I faced with the following problem:

Here is the code

For dimension Nxm=Nym=Nzm <=511 everything work fine.

For dimension Nxm=Nym=Nzm=512 cufftExecC2C returns CUFFT_EXEC_FAILED.

For dimension Nxm=Nym=Nzm=513 everything work fine again.

CUDA 4.0, Tesla C2050

for Tesla C2070 everything work fine for me

Any ideas?

I do have the same 512512512 error with cuda 4.0, and GTX 560ti and probably GTX 285.

I have the same code and definitely no idea of the origin of the problem…

The size is too large. Are you sure it really works for Nxm=513? I was able to go up to 400x400x400 doubles only. please check the memory, before the cudamalloc after cudamalloc and after the plan.

cufft uses different algorithms for data size [512 (power of 2)] and [511 or 513]. Probably the algorithm for [power of 2] needs more memory.

The code I use is maybe different, but I figured C2C was for single-float 32 bits.

My GTX 560Ti and 285 have 2Gb ram, which is apparently enough.
I can make computations on 508^3, and I can allocate a cuFFTcomplex volume in 508^3 plus a single float volume in 508508350, so I expected I could run a single 512^3 FFT.

But, as LF said, if cuFFT requires more memory for the faster 512 computation, I may hit the memory ceiling…
I knew FFT computations are real faster in 2^n, but I did not figure it would be at the expense of significantly more memory.

I brought my code to a C2050 telsa a few minutes ago… It has 3Gb, is used in blind mode (another graphics adaptater was installed), runs on cuda 4.1, gcc 4.5, and my code compiles and runs fine without modification.

Some figures:

GTX 560 ti:
508^3 single float CUFFT_INVERSE: 2.3s
512^3: unknown

TESLA C2050:
508^3: 0.7s
512^3: 0.12s

Tesla rocks…

Really check how much memory is used and check that the transform is really done.


I have intermittent access to the PC with the Tesla C2050: I just managed to run two backward 508^3 and 512^3 single float transforms and check the resulting volumes (reconstructed image of a cell). I confirm the reconstruction is correct and the timings correct. Maybe there are errors compared to fftw single float, but these are not obvious to the eye.

I did not (yet) use the nvidia timer but I ran a series of 200 in-place transforms which confirms exactly the aforementioned computing times. I also noted that the Tesla card was working in “adaptative mode” and not “performance”: power consumption on the plug was about 300W for the whole “big” Xein PC, GPU core temperature was below 65°c and the Tesla fan was barely audible.

Please note that timings only includes in-place fourier computation of data placed in GPU memory: transferts and plans are excluded.

Next time I’ll have access to the PC, I’ll check the amount of memory required by cuFFT, and I’ll also check how much memory I can reserve during the transform. I consider having my department buy one for me, so I need to be sure of my requirements.


I forgot to mention I was using Cuda 4.0 / gcc 4.4 / debian stable.
I switched to Cuda 4.1 / latest dev drivers / gcc 4.5 / debian testing, and now, I can finally compute my fourier transform in 512x512x512 single float.

It takes… 0.3s: real fast!!
In comparison, fftw takes 0.72 seconds in exhausive wisdom mode for 3 cores on a core i7 2600k with fftw 3.3.0.

(edited: was 1.60s with fftw 3.2.2)