Here is my implementation of batched 2D transforms, just in case anyone else would find it useful. I’ve developed and tested the code on an 8800GTX under CentOS 4.4. The API is consistent with CUFFT.
There is a lot of room for improvement (especially in the transpose kernel), but it works and it’s faster than looping a bunch of small 2D FFTs.
Thanks for all the help I’ve been given so far on this forum.
FFTW time = total time for single-threaded FFTW on a Q6600
Time/unit = each time divided by the number of transforms
L2 of diff = L2 norm of the difference between the batched and looped results
Don’t pay attention to the times in the first row - I didn’t bother to pre-initialize CUDA. Another thing to note is the data is already on the GPU. Host<->GPU transfers would obviously reduce performance.
FYI, my code runs 70 256x64 transforms at a time. The batched version is almost twice as fast at that size.
Thanks for posting this code, it was very helpful. I’ve posted some data from my 8600GTS, which I suspect would be on the low-end of the performance spectrum compared to what most people will be using.
I’m getting into a project which will require multiple 4D FFTs. Two ideas I’m pondering: 1) Using your method, perform another transpose around the z-axis to “flip” the data structure, then perform another batch of 2D FFTs. Or, 2) Perfom one 3-D FFT, then “flip” the structure and perform a batch of 1-D FFTs. Of course, all the data would remain on the device until completed.
I’m making the assumption that the 3D cuFFT performs a batch of 1D FFTs three times - but I don’t know if it has to transpose the data between batches or if it handles the data another way.
I’m a new user and I’m not sure I have a complete grasp on all of the variables yet, but I’d appreciate any ideas you or others might have.
I am by no means an FFT guru, but I’ll give you my thoughts.
I would suspect implementing another transpose/1D FFT would be easiest. I understand there are some memory access patterns you can use in a 3D FFT that may be faster than another transpose on a CPU, but the transpose should allow for coalesced memory accesses on the third FFT pass on the GPU.
I’d be interested to see what you come up with (if you can share your results).
I was wondering, if this batch implementation could be somehow used to speed up a single 2D Complex2Complex FFT (and iFFT).
I need to perform a 1024x1024 C2C FFT, and found cufft to be slower than FFTW when data transfers are included (and yes, I use pinned memory).
(My platform is a Core2Quad at 3.0 GHz and a G92 8800GTS clocked at 750/1750/2100).
Is it possible to “decompose” a single 2D 1024x1024 C2C FFT in a batch of 1D FFTs?
Or, do you have any other suggestion? I know, ideally one should perform more computation on the transformed data before moving them off the GPU, but I can’t do that.
Quick question:
In my C program I am looping a 2D FFT 3096 times. What’s the procedure to do the equivalent operation in CUDA? If I understand correctly, Jim tried it and it was slower. So the need for a 2D batch FFT? Can someone please explain?? Also if it’s a 2D batch FFT, does that mean I need to compute 2D FFT over an array of 4096(FFT_size) x 3096 (no. of loops)??
I have a (64x64) 4096 point 2D FFT. In C, the 2D FFT is looped 3096 times, because in every loop, the input is different. When I implement a CUDA version, I am able to compute the FFT only once. I am not sure whether I should loop it or do a batch FFT. At the same time, I am not able to do a loop as well as I am not sure how to change the inputs to the 2DFFT call.
If you can set up and fill in all 3096 input arrays contiguously in gpu memory, you can use a batched FFT. I’m not sure what you mean by only being able to compute the FFT once and not knowing how to change the inputs to the 2D FFT call. Can you be more clear or provide some code?