I have had to ‘roll my own’ FFT implementation in CUDA in the past, then I switched to the cuFFT library as the input sizes increased.

Profiling a multi-GPU implementation of a large batched convolution I noticed that the Pascal GTX 1080 was about 23% faster than the Maxwell GTX Titan X for the same R2C and C2R calls of the same size and configuration.

What I have heard from ‘the internet’ is that the fft algorithm implementation is ‘memory bound’ which surprised me since I have written my own and there is quite a bit of math in the implementation.

The Maxwell GTX Titan X has better overall global memory bandwidth and for ‘memory bound’ algorithms generally outperforms the GTX 1080.

Not complaining just wondering why the fft algorithm is classified as ‘memory bound’.

Am I missing something here?