what's cufft batch?


i have a 4096 samples array to apply FFT on it. batching the array will improve speed? is it like dividing the FFT in small DFTs and computes the whole FFT?

i don’t quite understand the use of the batch, and didn’t find explicit documentation on it… i think it might be two things, either:
divide one FFT calculation in parallel DFTs to speed up the process
calculate one FFT x times and average it for the result

both might be wrong ^^

anyone care to explain? maybe show me an explained example?

with batch=1 the FFTs take so much more time than IPP :\ i wanted to speed this up… (now it’s like 3 seconds IPP, 20 seconds CUFFT; 4096 samples C2C, 10000 1D FFTs, without magnitude calculation).
i don’t know if batching is the answer for it though…


The batch feature is simply used to find the fft of multiple vectors in a single call. This is much more efficient then simply calling the fft over and over in a loop since some of the intermediate twiddle factors can be reused. In order to utilize the batch function for your application all of the 10000 4096 point inputs should be in one long continuous linear memory (40960000 elements total).

The plan would look like:


The execution would look like:

cufftExecC2C(myPlan, idata, odata, CUFFT_FORWARD);

I’m pretty sure that a G80 should beat a CPU for this many fft’s even including the host/device transfers. Good luck.

thank you, your post was most informative.
something like this should be in the programming guide, it’s not explicit there how the batch works.

my problem was i was calling my function from the host 10000 times.
if i just put the cufftExec in a loop the results change very much.

100000 times gives 20 seconds in gpu and 34 in cpu, although the cpu calculates the magnitude of the values in each iteration and the gpu only once yet.
now i have to work on threading the magnitude function :)

I try to compute the maximum 1k fft i can on a tesla card, but the maximum i find without “CUFFT_INVALID_VALUE” is much lower than 100000…
When i compare the time between CPU and my calculation on GPU i don’t much of a difference…
Obviously, there is something wrong in my code…

Any idea ?