I didn’t find any information/answer on this topic. So, I’m posting my question…

Did someone perform any performance tests using cuFFT on K1 devboard? What can we expect for a 200x200 float pixels patch for example?

Such FFT is around 5 N log2( N ) = 5 x 40000 x 15.28 = 3 MFlops, hoping my evaluation is correct. So, one thousand of such patches should correspond to 3 GFlops (excluding memory buffer allocations, etc…) which is far under the maximum advertised GFlops value… Am I right?

The “oceanFFT” example in the Samples/Simulation folder might be exactly what you are looking for (it uses the libcufft) . After you “make” it, you can start it via terminal. Then have a look at the code, and go from there ;)

We conducted several tests related to cuFFT performances on the K1 performing 2D-FFT. FFT are computed using cufftPlanMany() to minimize processing time. All memory allocation and transfer are excluded from the evaluation. The data are as follow:

256x256 data patches

48 patches

15 iterations per patch
=> 720 x 2D-FFT computed.

Running that test on K1 gives us between 120 and 150ms. In 1s of processing, we might process 8 times more 2D-FFT which is equivalent to 5760 x 2D-FFT or 5760 Mflop/s.

I think you’re forgetting about the constant factor k in terms of complexity.
2D FFT complexity is O(NLog2(N)) = O(256256log(256256)) ~= 1M floating point multiplications.
Using this approximation, flops would indeed be 87201M = 5.76 Gflops

The true complexity however is k1NLog2N +k2…+k3… where the k parameters are constants. The influence of k2,k3,etc reduces with increasing value of N because the complexity associated with k1 is based on multiplications whereas the other parameters are based on computationally less expensive additions and constant costs such as setup-time. Therefore complexity can be approximated as kNLog2N