FFT on very large data sets

hamster143 · November 1, 2011, 10:11pm

I want to do FFT on large data sets (basically as much as I can fit in the system memory - say, 2G points.) Is there an easy way to accelerate this with a GPU? The CUFFT library will only go as far as 16M points on my card when working in double precision internally.

tera · November 2, 2011, 1:59am

I haven’t done this myself, but you should be able to use the recursive property of the Cooleyâ€“Tukey algorithm to break the FFT into multiple smaller FFTs that you can do on the device, and then combine them into one large FFT on the host. This would even allow to run a single FFT on multiple GPUs in parallel.

hamster143 · November 2, 2011, 6:59am

If I understand correctly how this works, since I don’t have enough spare system memory, I’d have to collect data for each hardware FFT call by walking through the entire system memory, and then write it back by going through it again.

E.g., if the GPU array size is 16M and the CPU array size is 2G, then N2 is 16M, N1 is 128, so,

collect input array = [x_0, x_128, x_256, x_384, … x_2147483632]

copy to the device

do HW FFT

copy back from the device

write back results into x_0, x_128, … x_2147483632

collect input array = [x_1, x_129, x_257, … 2147483633]

…

repeat 128 times. Then do a second round of FFTs (this time by executing 16M 128-element FFTs.)

In this case, system memory bandwidth will be a huge bottleneck. With all these widely spaced random accesses, it’s basically like having to read and write the entire contents of the memory 128 times.

But it’s definitely worth a try.

P.S. I think I could spare 1G of system memory as a scratch area, in which case I’d only need 16 passes instead of 128. That’s better.

Let’s see how well this will perform in practice.

pasoleatis · November 2, 2011, 8:28am

If I understand correctly how this works, since I don’t have enough spare system memory, I’d have to collect data for each hardware FFT call by walking through the entire system memory, and then write it back by going through it again.

E.g., if the GPU array size is 16M and the CPU array size is 2G, then N2 is 16M, N1 is 128, so,

collect input array = [x_0, x_128, x_256, x_384, … x_2147483632]

copy to the device

do HW FFT

copy back from the device

write back results into x_0, x_128, … x_2147483632

collect input array = [x_1, x_129, x_257, … 2147483633]

…

repeat 128 times. Then do a second round of FFTs (this time by executing 16M 128-element FFTs.)

In this case, system memory bandwidth will be a huge bottleneck. With all these widely spaced random accesses, it’s basically like having to read and write the entire contents of the memory 128 times.

But it’s definitely worth a try.

P.S. I think I could spare 1G of system memory as a scratch area, in which case I’d only need 16 passes instead of 128. That’s better.

Let’s see how well this will perform in practice.

The largest size I did was 9000x9000. I am wondering if you could split the work along directions and then use streams to overlap the calculations and copying from the gpu.

hamster143 · November 2, 2011, 8:53am

No, my task is strictly 1D.

cricri1 · November 2, 2011, 12:47pm

a lot of point ?
why
for calculating pi with more than 5000 000 000 000 decimal ??? External Image

for a size 4 194 304 (2048 line 2048 column) you can do 2048 fft of 2048 pt (read only 1 time each memory) on line
and after 2048 fft of 2048 pt on column (read only 1 time each memory) with the good twister

pasoleatis · November 3, 2011, 8:56am

Hello,

Just a silly suggestion. In the book called “Numerical recipes” (www.nr.com) there is a chapter regarding how to make fast Fourier transforms. From little I know I understand that even 1D fft has lots of parallel calculations. Maybe taking that code and dividing it in several parts as it was suggested before and also overlapping communications and calculations might do the trick.

joemyre · November 15, 2011, 5:14pm

Does this method not require a multiplication of twiddle factors in between the two rounds of FFTs? I tried implementing the method as is but have been unable to verify the results as correct.