CuFFT - Limits and decomposition of large data set

I’m working on a CUDA project and since other people at my lab have found out about it i’ve been asked to help with some other things too. One of the more interesting is doing 2d giga-element FFT’s.

Now i’ll be the first to say that I know absolutely nothing about how the FFT algorithm works, I do know what a Fourier Transform is and all that (so im not totally in the dark).

What I want to know is can you keep on pushing the CUFFT functions to run with huge data sets? I suspect the idea is yes, but it’ll be crap, or perhaps there is simply a flat out cut-off point.

This then raises the second question, has anyone tried butchering a distributed FFT algorithm, to decompose the problem into smaller “GPU efficient blocks” which could be shuffled on and off the card (or across the PCI bus and also around in the card RAM) where instead of running on many machines (as in a cluster) you just run all of the blocks in series (or on as many cards as you have in your box).

It seems like this would probably work quite nicely, and it would at least be faster than doing a regular FFT, simply because the FPU’s are super speedy on the card and you still have more compute paralellism than on a cpu.

I did a quick google, but only came up with decompositions over clusters of CPU’s.

Thanks for the help

FFT the algorithm naturally scales to larger and larger sizes without any problems because it continually breaks the problem into halves recursively. What you will run into in trying such large transforms on the GPU is a memory limitation (only 1.5 GiB on Tesla, less on other cards).

I see, the CUFFT documentation states that the 2d FFT stops at 16536 elements on a side (off the top of my head) so it seems like it might be efficient to break the problem down to the best size for occupancy / coalescing on the card. Along the lines that large scale cluster based fft algorithms work (i’m still trying to figure this out).

Ill keep reading things