I use FFT on x86 for mixed powers of 3, 5 and 7, but not for power of 2. I read from some old (2008) benchmark that CUFFT is not much faster than x86 for non-powers of two. Is there some newer benchmark comparing CUFFT to x86 for non-powers of two?
Niether of these are a direct answer to your question but may be of interest:
The most recently published CUDA (6.5) performance report is here: (cufft data on slides 7-9)
http://developer.download.nvidia.com/compute/cuda/6_5/rel/docs/CUDA_6.5_Performance_Report.pdf
And in CUDA 7, performance for transform sizes that are composite powers of 2,3,5, or 7 has been significantly improved:
[url]http://devblogs.nvidia.com/parallelforall/cuda-7-release-candidate-feature-overview/[/url]
Thanks. That gives me enough reason to convert my x86 code to cufft and get some benchmark myself.
If you have some benchmark data for non-power-of-two FFTs on modern Haswell-class Xeons, I’d love to see it. I spent 30 minutes searching the internet and came up empty handed.