FFT algorithm really memory bound?

I have had to ‘roll my own’ FFT implementation in CUDA in the past, then I switched to the cuFFT library as the input sizes increased.

Profiling a multi-GPU implementation of a large batched convolution I noticed that the Pascal GTX 1080 was about 23% faster than the Maxwell GTX Titan X for the same R2C and C2R calls of the same size and configuration.

What I have heard from ‘the internet’ is that the fft algorithm implementation is ‘memory bound’ which surprised me since I have written my own and there is quite a bit of math in the implementation.

The Maxwell GTX Titan X has better overall global memory bandwidth and for ‘memory bound’ algorithms generally outperforms the GTX 1080.

Not complaining just wondering why the fft algorithm is classified as ‘memory bound’.

Am I missing something here?

for example;

3.38315s  <b>3.3356ms </b>          (7812 1 1)       (24 21 1)        32        0B  7.6875KB         -           -  <b>GeForce GTX TIT</b>         1        14  void RADIX_M::radixM_kernel<unsigned int, unsigned int=41, RADIX_M::spec, float>

3.38321s  <b>2.5664ms</b>           (7812 1 1)       (24 21 1)        32        0B  7.6875KB         -           -  <b>GeForce GTX 108</b>         2        24  void RADIX_M::radixM_kernel<unsigned int, unsigned int=41, RADIX_M::spec, float>

first is the Maxwell Titan X , second is the Pascal GTX 1080

I am not sure how it plays out for small batched FFTs, but to my knowledge large single-precision FFTs performed via CUFFT have been memory bound since at least the Kepler generation, if not earlier. It is simply a question of computational density (which is quite low for FFTs), combined with the fact that FLOPS are growing faster than memory bandwidth. I have participated in FFT optimization only occasionally but it is clear that a lot of work goes into minimizing the number of floating-point operations required (on all platforms, not just GPUs).

One hypothesis to consider: The memory bandwidth ultimately achievable by FFTs may differ based on FFT “flavor” and thus memory access pattern (e.g. power of two vs composite), which may interact with details of how the physical memory subsystem is arranged (with dependencies on GPU architecture and possibly DRAM type). The interaction of core clocks with memory throughput, and the influence of boost clocks may further complicate matters.

So for a particular variant of FFT, on a particular GPU architecture, with a particular kind of DRAM, I would expect a more or less linear correlation between FFTs throughput and specified memory bandwidth.

Thanks. Your hypothesis seem plausible.

Any time I mix different GPUs in a multi-GPU implementation load balancing is always an issue, so in this case the GTX 1080 is going to get a larger percentage of the work.

Comparing the memory-relevant statistics from the CUDA profiler for these two GPUs might be enlightening. It’d be interesting to see how the GTX 1080 Ti compares on this computation.

fully compute bound apps generally need to be O(n2) or O(n3) compute intensity.

FFT is O(nlogn)


As njuffa states, simplistic treatments like this don’t nessarily reflect implementation specifics, but my experience is similar: for large FFTs, on CUFFT, the performance is strongly linearly correlated with the FFT size. That is to say, in my experience, CUFFT FFT execution time is proportional to FFT size.

This is true (in my experience) for both C2C and R2C transforms. Since R2C transforms comprise less input data, they are faster than the equivalent forward Real-to-Complex transform cast as a C2C.