K1 and cuFFT


I didn’t find any information/answer on this topic. So, I’m posting my question…

Did someone perform any performance tests using cuFFT on K1 devboard? What can we expect for a 200x200 float pixels patch for example?

Such FFT is around 5 N log2( N ) = 5 x 40000 x 15.28 = 3 MFlops, hoping my evaluation is correct. So, one thousand of such patches should correspond to 3 GFlops (excluding memory buffer allocations, etc…) which is far under the maximum advertised GFlops value… Am I right?

Thank you very much for your help,

i’m not sure about the detailed math here, but i followed the instructions here : http://www.pugetsystems.com/blog/2014/05/23/NVIDIA-Jetson-TK1-CUDA-performance-569/ to setup the CUDA environment (you have to be registered developer), i didn’t copy the samples though, and set the pathes by hand like here : http://askubuntu.com/questions/210884/setting-ld-library-path-for-cuda

The “oceanFFT” example in the Samples/Simulation folder might be exactly what you are looking for (it uses the libcufft) . After you “make” it, you can start it via terminal. Then have a look at the code, and go from there ;)

Thanks HellMood. I’ll start from there.


We conducted several tests related to cuFFT performances on the K1 performing 2D-FFT. FFT are computed using cufftPlanMany() to minimize processing time. All memory allocation and transfer are excluded from the evaluation. The data are as follow:

  • 256x256 data patches
  • 48 patches
  • 15 iterations per patch
    => 720 x 2D-FFT computed.

Running that test on K1 gives us between 120 and 150ms. In 1s of processing, we might process 8 times more 2D-FFT which is equivalent to 5760 x 2D-FFT or 5760 Mflop/s.

If we look at the chart (http://developer.download.nvidia.com/compute/cuda/6_0/rel/docs/CUDA_6_Performance_Report.pdf), same size 1D-FFT gives a speed of around 450Gflop/s. Knowing that K40 is about 10x faster than the K1, we should be able to process 7.8x more 2D-FFT during the same duration or run 7.8x faster… (450Gflop / 10 / 5.76Gflop = 7.8)

So, what are we missing there?

Thank you for your comments.

You may want to check your GPU MHz:


Hi patricka,

I think you’re forgetting about the constant factor k in terms of complexity.
2D FFT complexity is O(NLog2(N)) = O(256256log(256256)) ~= 1M floating point multiplications.
Using this approximation, flops would indeed be 87201M = 5.76 Gflops

The true complexity however is k1NLog2N +k2…+k3… where the k parameters are constants. The influence of k2,k3,etc reduces with increasing value of N because the complexity associated with k1 is based on multiplications whereas the other parameters are based on computationally less expensive additions and constant costs such as setup-time. Therefore complexity can be approximated as kNLog2N

The value of k1 depends on the actual implementation of the FFT.
Check out http://en.wikipedia.org/wiki/Fast_Fourier_transform for more info.