CUFFT: calculation time


I have tested the speedup of the CUFFT library in comparison with MKL library.

Everybody measures only GFLOPS, but I need the real calculation time.

(I use the PGI CUDA Fortran compiler ver.11.8 on Tesla C2050 and CUDA 4.0)

I measure the time as follows (without data transfer to/from GPU, it means only calculation time):

err = cudaEventRecord ( tstart, 0 ); 

do ntimes = 1,Nt 

	call cufftExecD2Z(planRe, aRe_dev, aCo_dev_out, -1) 


err = cudaEventRecord ( tstop, 0 ); 

err = cudaEventSynchronize ( tstop ); 

err = cudaEventElapsedTime ( start_stop_time, tstart, tstop );

So, I make 1000 iterations to get any reasonable time.

I have tested 2D FFT Double to Complex (D2Z) for different sizes (time in milliseconds):

MKL, ms          CUFFT, ms            speedup 

128 x 128     112                29                  3.86 

256 x 256     527                62                  8.5 

512 x 512     2307              225                 10.25

and obtained some unexpected results:

MKL, ms            CUFFT, ms            speedup 

128 x 32          26                    27                  1 

32 x 128          26                    23                  1 

256 x 32          50                    28                1.8 

32 x 256          53                    23                2.3

For these sizes I didn’t obtain any actual speedup. Is it really so? Or I make something wrong?

Any ideas?


Probably not. Try the over 1024x1024

what about FFTW? is it faster than MKL?
I have a code where only power of 2 2D real to complex (and inverse) transforms are needed. Is CUFFT the best for CUDA calculation in this situation? I had to transform several slices all with the same number of points (64x64 or 128x128 or 256x256 depending on the resolution I need) and I found that FFTW performs better than CUFFT on those dimensions. Is it right or am I wrong something?

I think you are right. You need large matrices to see speed-up. If you have to do transforms of arrays of same size and independent of each other you can try to group them in one call using batch option.
I had one program for solving partial differential equation by an iterative process with very little communications, and in this case I saw some reasonable speed-up for small sizes compared to fftw.

Are these timings given for 1000 iterations or you’ve already divided by 1000?

If you didn’t divide, than it is expected - you get ~20+ microseconds per FFT. Note that one 2D FFT is likely implemented using multiple kernels invocations and each of them costs at least 5 microseconds.

The problem is that your FFT sizes are too small and GPUs are not particularly efficient when it comes to solving small problems.

@ pasoleatis:
My program solves partial differential equations in 3 dimensions with periodic conditions in 2 dimensions so I have to perform NZ 2D-FFTs with NX x NY points each. I will try to group them ( cufftPlanMany(), right??), tnx for the suggestion!
Anyway, in your experience, have you found better performance with iterative methods against using FFT on GPUs?


in may code I have at each time time step 2 forward FFT and one inverse FFT of 2D matrices of size lx by ly and some simple operations in between (like a_ij+b_ij.) I only ran my codes on our supecomputer which had amd opterons at 2.2 GHz and only single core versions using fftw ( I was too lazy to put in MKL). I got a relative speed up betwen 20 and 60 depending on lx and ly.