I am using FFT for an Image Processing application. I am planning to replace my existing CPU based FFT (which is based on Cooley Tuckey algorithm) with CUFFT.
I have few questions regarding CUFFT.
( 1 ) Which is the FFT algorithm used internally by CUFFT. Is it Cooley Tuckey?
Then only I can compare the performance.
( 2 ) How much speed up can I expect for an Image of size 5k * 5k? A rough idea.
My CPU application takes around 71 seconds to complete.
On a G92, for a Complex2Complex Forward transform + Backward transform (that is, you get your original image back) of a 2048x2048 grayscale image, I measured around 100 milliseconds (0.1 seconds), including interleaving/deinterleaving (Re + Im ↔ complex) and Host<->GPU data transfers.
For comparison, FFTW takes around 0.3 seconds (3x the time) for the same task, and that’s with a 3.0 GHz quad-core using the Single Precision SSE version of FFTW, multi-threaded (NThreads = 4).
So I’d say, CudaFFT is pretty fast in this situation.
Please note that I can’t try 5Kx5K, since on my G92 512MB I can only go as far as 3Kx3K or something around that (I think a 1.5GB setup should do 5Kx5K).
Your mileage may vary if your image has rows# and/or cols# not multiple of the number of Stream Multiprocessors your GPU has (because of uncoalesced memory access and/or bank conflicts, if I recall well).