FFT Slower with CUDA

I am trying to get some of the speedup others have mentioned with the CUDA FFT calls. My application requires a 1D FFT of a matrix of data that is usually either 8192X4096 or 1024X1024 depending on the application. I used the FFT code from the link [post=“100”]http://developer.nvidia.com/object/matlab_cuda.html[/post] which is an example that many have tried. I took the 2D FFT and just made a 1D FFT call and no other changes. Using the tic and toc in Matlab I have noticed the CUDA FFT is slower for my application. I am using a GTX285 with 2G of RAM, but I also have an Intel i7 975 processor for the main CPU that is tough to beat. I am using Matlab 2009a and the FFT is processed across all 8 cores on the i7 chip.

I have noticed there are some cases where CUDA can be faster such as the 2D FFT, but can the 1D FFT be faster. I don’t know if this is an implementation problem with a 1D FFT or if there is just not enough gain in the GPU speed to overcome the extra memory transfers. I would be very open to any ideas or problems I may need to check for. I would also like to know if others have experienced this same problem with CUDA.

Here are the results I am getting below.
8192X4096 FFT
CUDA - .28 s
Matlab - .15s

1024x1024 FFT
CUDA - .010 s
Matlab - .005s

Thanks for any help in this.

We have similar results.
I suppose MATLAB routines are programmed with Intel MKL libraries, some routines like FFT or convolution (1D and 2D) are optimized for multiple cores and -as far as we could try- they are much faster than CUDA routines with medium-size matrices. :-(
I’m very interested in any clue about this issue, too.

Thanks !

This is expected (CUDA being faster only for large datasets). You have additional overhead when using CUDA - copying the data to the GPU and back and setting up the kernel. Also, small datasets limit the number of threads running (less data parallelism). There’s usually a threshold at where it becomes beneficial to use CUDA. It requires experimentation on given hardware to find it. For 1D FFTs it’s usually pretty high, I hear.

Is your timing of the CUDA FFT including the significant time it spends making the plan? It would be a good idea to make

the plan one time for your particular FFT size then save that plan and use it over and over.