Using FFT for integrations: Single Threaded FFT?

Hi All!

We got a GTX260 in my workstation to hopefully do some parallel numerical integrations and after researching, I have a slew of questions.

The three kinds of integrators I hope to test run as such.

Pseudo-Spectral: Each vector is 2^8 long. Maybe 2^9 depending on solution stability.

Foreach timestep

    [*] FFT vector

    [*] math and iFFT

    [*] FFT again

    [*] math and iFFT

    [*] FFT again

Crank-Nicolson:

Foreach timestep

    [*] Shift and add some vectors

    [*] Inverse a matrix (in a clever way)

Vanilla RK4:

Foreach timestep

    [*] Shift and add some vectors

    [*] multiply some stuff

    [*] Shift and add some vectors

So, I basically need to run these integrators 2000 (or many more) times across different parameters and give back results I can then analyze in Matlab. CUFFT is really good at doing LONG vectors many many times, but I have a relatively short vector (256) and I need to fiddle with stuff inbetween each FFT.

I had planned on running each integration as its own thread on the GPU, but the CUFFT wants to do the FFT across the whole chip. I don’t think doing CUFFT for a 256 long vector will give me much improvement over just running it on the CPU. However if I can run all these integrators as single threads, I imagine I can get a (constant*multiple of GPU cores) speedup.

Am I going to have to code a simple single-threaded FFT implementation? Is it possible to use an FFT implementation from the GNU Scientific Library (GSL) as a linked in library that gets run on the GPU?

thanks a bunch! I’m super CUDA newbie, so any help is awesome.

Max

This may be a hard road to hoe. GPUs are SIMT, which means that you essentially need to issue the same instruction to each thread in a half-warp in order to prevent pipeline slowness. The memory coalescing requirements further complicate things.

If you REALLY want a to make a single-threaded fft, then KISSFFT might be a decent starting point. It would need to be refactored a bit, as CUDA doesn’t support recursion.

I suggest a better use of your time is figuring out how to make use of CUFFT to do a whole bunch of small FFTs at once using the batch mode. That is what CUFFT excels at.