The size is too large. Are you sure it really works for Nxm=513? I was able to go up to 400x400x400 doubles only. please check the memory, before the cudamalloc after cudamalloc and after the plan.
The code I use is maybe different, but I figured C2C was for single-float 32 bits.
My GTX 560Ti and 285 have 2Gb ram, which is apparently enough.
I can make computations on 508^3, and I can allocate a cuFFTcomplex volume in 508^3 plus a single float volume in 508508350, so I expected I could run a single 512^3 FFT.
But, as LF said, if cuFFT requires more memory for the faster 512 computation, I may hit the memory ceiling…
I knew FFT computations are real faster in 2^n, but I did not figure it would be at the expense of significantly more memory.
I brought my code to a C2050 telsa a few minutes ago… It has 3Gb, is used in blind mode (another graphics adaptater was installed), runs on cuda 4.1, gcc 4.5, and my code compiles and runs fine without modification.
Some figures:
GTX 560 ti:
508^3 single float CUFFT_INVERSE: 2.3s
512^3: unknown
I have intermittent access to the PC with the Tesla C2050: I just managed to run two backward 508^3 and 512^3 single float transforms and check the resulting volumes (reconstructed image of a cell). I confirm the reconstruction is correct and the timings correct. Maybe there are errors compared to fftw single float, but these are not obvious to the eye.
I did not (yet) use the nvidia timer but I ran a series of 200 in-place transforms which confirms exactly the aforementioned computing times. I also noted that the Tesla card was working in “adaptative mode” and not “performance”: power consumption on the plug was about 300W for the whole “big” Xein PC, GPU core temperature was below 65°c and the Tesla fan was barely audible.
Please note that timings only includes in-place fourier computation of data placed in GPU memory: transferts and plans are excluded.
Next time I’ll have access to the PC, I’ll check the amount of memory required by cuFFT, and I’ll also check how much memory I can reserve during the transform. I consider having my department buy one for me, so I need to be sure of my requirements.
I forgot to mention I was using Cuda 4.0 / gcc 4.4 / debian stable.
I switched to Cuda 4.1 / latest dev drivers / gcc 4.5 / debian testing, and now, I can finally compute my fourier transform in 512x512x512 single float.
It takes… 0.3s: real fast!!
In comparison, fftw takes 0.72 seconds in exhausive wisdom mode for 3 cores on a core i7 2600k with fftw 3.3.0.