Does cufft show much higher efficiency than cpu fft routines?

I’ve tested cufft from cuda 2.3 and cuda 3.0. Compared with the fft routines from MKL, cufft shows almost no speed advantage.
I’m just about to test cuda 3.1. May the result be better. External Image

My understanding is that the Intel MKL FFTs are based on FFTW (Fastest Fourier transform in the West) from MIT. Benchmarking CUFFT against FFTW, I get speedups from 50- to 150-fold, when using CUFFT for 3D FFTs. Maybe you could provide some more details on your benchmarks. Single 1D FFTs might not be that much faster, unless you do many of them in a batch.

I personally have not used the CUFFT code, but based on previous threads, the most common reason for seeing poor performance compared to a well-tuned CPU is the size of the FFT. Small FFTs underutilize the GPU and are dominated by the time required to transfer the data to/from the GPU.

Pure kernel speedup is usually significant. If you can restructure your problem so that many modules ( not just the FFTs ) are done on the GPU your overall speedup will probably be much better. Right now the host<—> device communication is probably whats killing you.

I did 1D FFTs in batches. I tested the length from 32 to 1024, and different batch sizes. In all cases cufft was not faster. The following is one of the result:

n=1024 batch=1000

MKL: run 1.4998019E-02 sec

CUFFT: run 1.6996980E-02 sec

I suspect that for cufft 1d FFTs has no advantages.

In fact, I want to replace some of my fft routine in a large program with gpu fft routines. The length of my FFTs will be no larger than 512, but can be done in batches. Does it mean for such a size cufft won’t beat fftw? If so, is there any other gpu fft package that can obtain higher efficiency?

recently saw some performance comparisons here between cufft and intel mkl: [url=“The Official NVIDIA Forums | NVIDIA”]The Official NVIDIA Forums | NVIDIA ( see slide number 4)

But it doesnt seem to mention the number of batches?

From the link it seems that cufft 3.1 on tesla c1060 has doubled GFlops (double precision) as that of mkl. My hardware environment is GeForce GTX 285 + Intel Core 2 Duo E7500, 2.93GHz. But I didn’t get the similar result as shown in the linked file. No faster speed of cufft 3.1! Is this abnormal?? I kind of suspect that there is some problem with my code…

Well for one i suspect you are comparing memcpytime + kerneltime with MKL fft_kernel ( apples and oranges? ) where the memcpytime might be what’s eating a lot of time here. So maybe you can run the CUDA visual profiler and get a detailed look at the timings and then post them here ?

Since I’m testing on a Linux cluster, there is no GUI, the visual profiler is not available. I do timing by using the function etime() for my fortran code. I think you are right, I’m comparing memcpytime + kerneltime with MKL fft_kernel, because my aim is to reduce the total run time of my program, I take it for granted that the memcpytime should be included in the estimation.

Here are some of the results:

n=4096 batch=2048

MKL: user 0.1379790

     sys  1.0000169E-03

     run  0.1389791    

CUFFT: user 0.1229820

     sys  5.9989989E-03

     run  0.1289810

n= 128, batch=65536

MKL: user 0.1099830

     sys   0.000000    

     run  0.1099830    

CUFFT: user 0.1169820

     sys  4.9990118E-03

     run  0.1219810  

n= 8192 , batch = 1024

MKL: user 0.1539760

     sys   0.000000    

     run  0.1539760    

CUFFT: user 0.1239810

     sys  5.9989989E-03

     run  0.1299801    

Here I make n*batch=2^23.

ps. Has anyone ever used GPUFFTW?

Just use:
export CUDA_PROFILE=1

Are you using pinned memory for the arrays on the host?