It’s written that CUFFT library supports algorithms that higly optimized for input sizes can be written in the folowing form: 2^a X 3^b X 5^c X 7^d.
How could they managed to do that?
For as far as I know, FFT must provide best perfomance only for 2^a input size.
While sizes that are powers of two typically show the best performance, it is possible to achieve similar performance for powers of other small prime factor. Slide 8 of the following slide deck gives an overview for CUDA 6.5:
Mixed-radix FFTs for sizes that are products of the powers of small prime factors have been extensively researched since the 1970s.