I’m doing some benchmarking of CUFFT and would like to know if my results are reasonable or not and would be happy if you would post some of your results and also specify what card you have.
I have a FX 4800 card.
The times and calculations below are for FFT followed by an invFFT
For a 4096K long vector, I have a KERNEL time (not counting memory copy times that is) of 14ms.
Using the 5Nlog(N;2) * 2 (the 2 comes from doing both fft and inv fft) formula, that gives 922 746 880 operations in 14ms gives:
(922 746 880) / (0.014s) = 62 GFLOPS.
I find this reasonable when comparing to other reuslts.
However, when computing for vectors of 4096 elements and using a batch of 8192, we should have 8192 (5Nlog(N) * 2) operations. The operation took 34ms
This to me seems a bit high. I know the theoretical GFLOPS is higher, but this computation has complexity NlogN, and hence I feel it is a bit high. Also, I do not understand why the number of batches is such a large factor. Do you have any theory as to why the increased number of batches increases the GFLOPS?
Thanks, and looking forward to hearing your comments and thoughts!
From what I know about CUFFT (and FFTW for that matter) is that they create a “plan” to implement an FFT. That is, they compose the overall FFT out of several smaller sub-FFTs, trying to minimize the overall execution time.
When you do multiple batches, the plan will be reused and the overhead to create the plan is less dominant in the overall time of execution. So the GFlops number rises because of the reduced overhead.
Take all this with a grain of salt, I have no clue what I am talking about. ;)