I need to achieve the best performance with large size 3D FFT. I only use powers of 2 sizes, specifically 512^3 and 256^3. It seems that quite a few people report results that are much better than CUFFT but they generally do not make they source codes available. Except for CUFFT and Nukada FFT I found the following references:
Li, X., & Siegel, J. (n.d.). An Empirically Tuned 2D and 3D FFT Library on CUDA GPU Categories and Subject Descriptors. Tensor, 305-314.
Volkov, V., & Kazian, B. (2008). Fitting FFT onto the G80 Architecture. University of California Berkeley, 6.
Govindaraju, N. K., Lloyd, B., Dotsenko, Y., Smith, B., & Manferdelli, J. (n.d.). High Performance Discrete Fourier Transforms on Graphics Processors.
Does anyone have an advice of whose approach is better or if I am missing something and there are more FFT libraries available?