cufft algorithm


I am interested in cufft implementation rather than its usage/manual. Which algorithm is CUDA FFT based on? Is there any documentation (or code) about that?

Thanks in advance,

it was ported from fftw library.

It is not true.
It has an interface similar to FFTW ( create a plan, execute the plan, destroy the plan) but the algorithms are different.


Based on my experience, the 2D and real valued versions of FFT are quite slow, most likely because they’re wrappers around the complex versions. I was going to try to build a specialized 2D, real valued FFT, but do you know if NVIDIA is in the process of releasing such a version any time soon?

This is a really good paper for implementing a faster version:…t.aspx?id=70576

Ah, great. But after skimming it, it looks like their 2D, real performance is only 2x better than CUFFT, which isn’t what I was looking for. They didn’t use any specialized 2D approaches - they still transform the rows, then the cols. I don’t think their argument of using a 2D texture cache to make the column transforms have better locality is good - for small blocks it works, but I don’t know of how a cache can be organized so that a large row and column both have spatial locality.

Also, they used the text book approach of doing 2 for the price of 1, for real valued FFTs by converting 2 real FFTs into a complex FFT.

There was another report from Microsoft (I know Burton Smith from a talk he gave @Georgia Tech) that got better results ~300 Gflops on GTX 280, but it was only for 1D, complex FFTs.

What I’m really looking for is a 2D transform that does a 2D decomposition for better memory locality and avoids complex math at least at the finest grid levels.
The best references I can find on specialized 2D real FFTs are Pierre Duhamel’s reports from the 1980s.