cufft power of 2 performance

I notice any non-power FFT size dramatically reduce the speed of FFT, so I would like to use power of 2 size for best performance. for example 500000 vs 2^19, does cudafftplan etc has any automatically padding options?

is there any document on different FFT size benchmark?

http://developer.download.nvidia.com/compute/cuda/compute-docs/cuda-performance-report.pdf
Slide 4

thx for the chart

regarding cufftPlanMany if my array size n is 1024, inembed is 1024, istride is 836, does the fft pad the rest with zero or its taking full 1024 from ram, then take next set of 1024 data by offset 1024-836, hence overlapping the fft?