I have had to ‘roll my own’ FFT implementation in CUDA in the past, then I switched to the cuFFT library as the input sizes increased.
Profiling a multi-GPU implementation of a large batched convolution I noticed that the Pascal GTX 1080 was about 23% faster than the Maxwell GTX Titan X for the same R2C and C2R calls of the same size and configuration.
What I have heard from ‘the internet’ is that the fft algorithm implementation is ‘memory bound’ which surprised me since I have written my own and there is quite a bit of math in the implementation.
The Maxwell GTX Titan X has better overall global memory bandwidth and for ‘memory bound’ algorithms generally outperforms the GTX 1080.
Not complaining just wondering why the fft algorithm is classified as ‘memory bound’.
Am I missing something here?