Best FFT library for Fermi architecture what do you use for best performance?

I need to achieve the best performance with large size 3D FFT. I only use powers of 2 sizes, specifically 512^3 and 256^3. It seems that quite a few people report results that are much better than CUFFT but they generally do not make they source codes available. Except for CUFFT and Nukada FFT I found the following references:

Li, X., & Siegel, J. (n.d.). An Empirically Tuned 2D and 3D FFT Library on CUDA GPU Categories and Subject Descriptors. Tensor, 305-314.
Volkov, V., & Kazian, B. (2008). Fitting FFT onto the G80 Architecture. University of California Berkeley, 6.
Govindaraju, N. K., Lloyd, B., Dotsenko, Y., Smith, B., & Manferdelli, J. (n.d.). High Performance Discrete Fourier Transforms on Graphics Processors.

Does anyone have an advice of whose approach is better or if I am missing something and there are more FFT libraries available?

What is the kind of GPU you use at the moment?
By the way, do you need to work in simple or double precision?

Volkov’s code I think is part of CUFFT now btw

Can anyone verify this, that the Volkov&Kazian code is used by the current CUFFT? We recently made a new FFT implementation that vastly outperforms the current CUFFT, but how do I tell whether we outperform Volkov&Kazian or just play in their league?

PS: I found some answers here;

https://devtalk.nvidia.com/default/topic/392645/cuda-programming-and-performance/my-speedy-fft-3x-faster-than-cufft/