what about FFTW? is it faster than MKL?
I have a code where only power of 2 2D real to complex (and inverse) transforms are needed. Is CUFFT the best for CUDA calculation in this situation? I had to transform several slices all with the same number of points (64x64 or 128x128 or 256x256 depending on the resolution I need) and I found that FFTW performs better than CUFFT on those dimensions. Is it right or am I wrong something?

I think you are right. You need large matrices to see speed-up. If you have to do transforms of arrays of same size and independent of each other you can try to group them in one call using batch option.
I had one program for solving partial differential equation by an iterative process with very little communications, and in this case I saw some reasonable speed-up for small sizes compared to fftw.

Are these timings given for 1000 iterations or you’ve already divided by 1000?

If you didn’t divide, than it is expected - you get ~20+ microseconds per FFT. Note that one 2D FFT is likely implemented using multiple kernels invocations and each of them costs at least 5 microseconds.

The problem is that your FFT sizes are too small and GPUs are not particularly efficient when it comes to solving small problems.

@ pasoleatis:
My program solves partial differential equations in 3 dimensions with periodic conditions in 2 dimensions so I have to perform NZ 2D-FFTs with NX x NY points each. I will try to group them ( cufftPlanMany(), right??), tnx for the suggestion!
Anyway, in your experience, have you found better performance with iterative methods against using FFT on GPUs?

in may code I have at each time time step 2 forward FFT and one inverse FFT of 2D matrices of size lx by ly and some simple operations in between (like a_ij+b_ij.) I only ran my codes on our supecomputer which had amd opterons at 2.2 GHz and only single core versions using fftw ( I was too lazy to put in MKL). I got a relative speed up betwen 20 and 60 depending on lx and ly.