Hi,
I’m working on a CUDA project and since other people at my lab have found out about it i’ve been asked to help with some other things too. One of the more interesting is doing 2d giga-element FFT’s.
Now i’ll be the first to say that I know absolutely nothing about how the FFT algorithm works, I do know what a Fourier Transform is and all that (so im not totally in the dark).
What I want to know is can you keep on pushing the CUFFT functions to run with huge data sets? I suspect the idea is yes, but it’ll be crap, or perhaps there is simply a flat out cut-off point.
This then raises the second question, has anyone tried butchering a distributed FFT algorithm, to decompose the problem into smaller “GPU efficient blocks” which could be shuffled on and off the card (or across the PCI bus and also around in the card RAM) where instead of running on many machines (as in a cluster) you just run all of the blocks in series (or on as many cards as you have in your box).
It seems like this would probably work quite nicely, and it would at least be faster than doing a regular FFT, simply because the FPU’s are super speedy on the card and you still have more compute paralellism than on a cpu.
I did a quick google, but only came up with decompositions over clusters of CPU’s.
Thanks for the help