I’m doing some benchmarking of CUFFT and would like to know if my results are reasonable or not and would be happy if you would post some of your results and also specify what card you have.
I have a FX 4800 card.
The times and calculations below are for FFT followed by an invFFT
For a 4096K long vector, I have a KERNEL time (not counting memory copy times that is) of 14ms.
Using the 5Nlog(N;2) * 2 (the 2 comes from doing both fft and inv fft) formula, that gives 922 746 880 operations in 14ms gives:
(922 746 880) / (0.014s) = 62 GFLOPS.
I find this reasonable when comparing to other reuslts.
However, when computing for vectors of 4096 elements and using a batch of 8192, we should have 8192 (5Nlog(N) * 2) operations. The operation took 34ms
(8192 * 5 * 4096 * log(4096;2) * 2) / (0.034s) = 110 GFLOPS.
This to me seems a bit high. I know the theoretical GFLOPS is higher, but this computation has complexity NlogN, and hence I feel it is a bit high. Also, I do not understand why the number of batches is such a large factor. Do you have any theory as to why the increased number of batches increases the GFLOPS?
Thanks, and looking forward to hearing your comments and thoughts!