The documentation suggests that the array to FFT can be any size. I assume that the FFT implementation is not done in a completely general way for performance reasons.
Do certain sizes give better performance than others? For example, does a power of 2 do better than a prime number? Does a power of 3 do better than a power of 2?
Even based on theoretical bounds (regardless of implementation), any size that can be factored into a product of small primes with give better performance than a single prime.
In general with CUFFT, you will get the best performance for radix-n sizes, where n is 4,2,3,5,7, in that order. That is, radix-4 is fastest, 2 faster than 3, and so on.