How many GFLOPS do you get with CUFFT?


I’m doing some benchmarking of CUFFT and would like to know if my results are reasonable or not and would be happy if you would post some of your results and also specify what card you have.

I have a FX 4800 card.

The times and calculations below are for FFT followed by an invFFT

For a 4096K long vector, I have a KERNEL time (not counting memory copy times that is) of 14ms.

Using the 5Nlog(N;2) * 2 (the 2 comes from doing both fft and inv fft) formula, that gives 922 746 880 operations in 14ms gives:

(922 746 880) / (0.014s) = 62 GFLOPS.

I find this reasonable when comparing to other reuslts.

However, when computing for vectors of 4096 elements and using a batch of 8192, we should have 8192 (5Nlog(N) * 2) operations. The operation took 34ms

This gives:

(8192 * 5 * 4096 * log(4096;2) * 2) / (0.034s) = 110 GFLOPS.

This to me seems a bit high. I know the theoretical GFLOPS is higher, but this computation has complexity NlogN, and hence I feel it is a bit high. Also, I do not understand why the number of batches is such a large factor. Do you have any theory as to why the increased number of batches increases the GFLOPS?

Thanks, and looking forward to hearing your comments and thoughts!


From what I know about CUFFT (and FFTW for that matter) is that they create a “plan” to implement an FFT. That is, they compose the overall FFT out of several smaller sub-FFTs, trying to minimize the overall execution time.

When you do multiple batches, the plan will be reused and the overhead to create the plan is less dominant in the overall time of execution. So the GFlops number rises because of the reduced overhead.

Take all this with a grain of salt, I have no clue what I am talking about. ;)



Hehe, ok. Thanks. Only problem is that my timings are done only over the execution kernels, not the plan creator.


Let me propose another theory:

Batched FFTs are achieved with the same number of kernel calls as unbatched FFTs (internally inside CUFFT)

In the batched case a kernel simply processes more data at once. So in this case, you have less kernel call overhead in your overall execution time.

Sounds reasonable?

There is more memory locality when doing many short FFTs vs one really big one. FFTs are memory bound. Hence the batched FFTs are faster.