FFT performance loss? FFT-Size 16k and above


I am currently at a project which needs to perform quite large FFTs. The cuFFT library contains a lot of useful stuff to do so.

I now did some performance measuring and the small FFTs run quite fast (like expected on that kind of architecture) but at exactly 16k FFT-points there is a big performance loss. When I look at the profiler output I see, that the 16k FFT is seperated into more than 1000 kernels. If I sum the GPU timings up it doesn’t look that bad, but I also did measurements around the whole procedure and there occurs an overhead of more than 3 times the complete FFT-timing itself.

I think the overhead is connected to the number of kernel calls for if I perform a 32k FFT the number of kernels halves and the overhead is about 1/3 of the 16k’s, numbers fallling as the FFT-length increases.

My guess is that the FFT-planer can’t provide a better plan for the 16k FFT. It is mostly (if not only) using radix2 parts which are slower and of course have to be called more often than radix4.

Am I taking this correct?
Does anyone have any suggestion how to speed things up? For my project this means a performance loss of about 3/4 at 16k FFT points.