CUFFT: calculation time

esem · December 9, 2011, 4:24pm

Hi,

I have tested the speedup of the CUFFT library in comparison with MKL library.

Everybody measures only GFLOPS, but I need the real calculation time.

(I use the PGI CUDA Fortran compiler ver.11.8 on Tesla C2050 and CUDA 4.0)

I measure the time as follows (without data transfer to/from GPU, it means only calculation time):

err = cudaEventRecord ( tstart, 0 ); 

do ntimes = 1,Nt 

	call cufftExecD2Z(planRe, aRe_dev, aCo_dev_out, -1) 

enddo 

err = cudaEventRecord ( tstop, 0 ); 

err = cudaEventSynchronize ( tstop ); 

err = cudaEventElapsedTime ( start_stop_time, tstart, tstop );

So, I make 1000 iterations to get any reasonable time.

I have tested 2D FFT Double to Complex (D2Z) for different sizes (time in milliseconds):

MKL, ms          CUFFT, ms            speedup 

128 x 128     112                29                  3.86 

256 x 256     527                62                  8.5 

512 x 512     2307              225                 10.25

and obtained some unexpected results:

MKL, ms            CUFFT, ms            speedup 

128 x 32          26                    27                  1 

32 x 128          26                    23                  1 

256 x 32          50                    28                1.8 

32 x 256          53                    23                2.3

For these sizes I didn’t obtain any actual speedup. Is it really so? Or I make something wrong?

Any ideas?

Regards,

pasoleatis · December 9, 2011, 11:58pm

Probably not. Try the over 1024x1024

Franz86 · April 19, 2012, 2:51pm

what about FFTW? is it faster than MKL?
I have a code where only power of 2 2D real to complex (and inverse) transforms are needed. Is CUFFT the best for CUDA calculation in this situation? I had to transform several slices all with the same number of points (64x64 or 128x128 or 256x256 depending on the resolution I need) and I found that FFTW performs better than CUFFT on those dimensions. Is it right or am I wrong something?

pasoleatis · April 19, 2012, 8:13pm

I think you are right. You need large matrices to see speed-up. If you have to do transforms of arrays of same size and independent of each other you can try to group them in one call using batch option.
I had one program for solving partial differential equation by an iterative process with very little communications, and in this case I saw some reasonable speed-up for small sizes compared to fftw.

vvolkov · April 20, 2012, 7:29am

Are these timings given for 1000 iterations or you’ve already divided by 1000?

If you didn’t divide, than it is expected - you get ~20+ microseconds per FFT. Note that one 2D FFT is likely implemented using multiple kernels invocations and each of them costs at least 5 microseconds.

The problem is that your FFT sizes are too small and GPUs are not particularly efficient when it comes to solving small problems.

Franz86 · April 21, 2012, 12:18pm

@ pasoleatis:
My program solves partial differential equations in 3 dimensions with periodic conditions in 2 dimensions so I have to perform NZ 2D-FFTs with NX x NY points each. I will try to group them ( cufftPlanMany(), right??), tnx for the suggestion!
Anyway, in your experience, have you found better performance with iterative methods against using FFT on GPUs?

pasoleatis · April 21, 2012, 7:51pm

Hello,

in may code I have at each time time step 2 forward FFT and one inverse FFT of 2D matrices of size lx by ly and some simple operations in between (like a_ij+b_ij.) I only ran my codes on our supecomputer which had amd opterons at 2.2 GHz and only single core versions using fftw ( I was too lazy to put in MKL). I got a relative speed up betwen 20 and 60 depending on lx and ly.

Topic		Replies	Views
Bad Performance of CUFFT library? compilation flags for optimizing fft performance CUDA Programming and Performance	11	13494	February 17, 2012
Does cufft show much higher efficiency than cpu fft routines? CUDA Programming and Performance	10	9188	July 19, 2010
cufft doubt comparing r2c and c2c 2D FFTs CUDA Programming and Performance	28	13497	October 27, 2010
Comparing cuda fft and matlab fft CUDA Programming and Performance	5	6166	February 10, 2008
FFT Performance CUDA Programming and Performance	4	4694	March 3, 2010
cuFFT doubt. GPU-Accelerated Libraries	1	808	January 18, 2015
CUFFT Newbei Question CUDA Programming and Performance	1	2898	May 4, 2010
FFT Speed vs. x86 CUDA Programming and Performance	14	24777	July 27, 2008
cufft error (?) CUDA Programming and Performance	7	9005	March 5, 2012
Expected performance of double precision arithmetic CUDA Programming and Performance	8	4004	August 20, 2009

CUFFT: calculation time

Related topics