On a G92, for a Complex2Complex Forward transform + Backward transform (that is, you get your original image back) of a 2048x2048 grayscale image, I measured around 100 milliseconds (0.1 seconds), including interleaving/deinterleaving (Re + Im <-> complex) and Host<->GPU data transfers.
For comparison, FFTW takes around 0.3 seconds (3x the time) for the same task, and that’s with a 3.0 GHz quad-core using the Single Precision SSE version of FFTW, multi-threaded (NThreads = 4).
So I’d say, CudaFFT is pretty fast in this situation.
Please note that I can’t try 5Kx5K, since on my G92 512MB I can only go as far as 3Kx3K or something around that (I think a 1.5GB setup should do 5Kx5K).
Your mileage may vary if your image has rows# and/or cols# not multiple of the number of Stream Multiprocessors your GPU has (because of uncoalesced memory access and/or bank conflicts, if I recall well).
More benchmarks here: