I’m working on a program that uses CUFFT to do a 3D FFT. Running cudaprof informs me that the FFT kernels are taking about 12% of my total GPU time - a bit more than I would have expected (when the same algorithm is implemented on a CPU, the FFT takes only a few percent of the computation time), but not too bad. But optimizing my other calculations hasn’t sped up the program as much as it should based on the cudaprof figures, and when I actually time different parts of the program, it’s clear that the FFT is actually taking much more time than that: at least 25% of the total computation time, maybe more.

Looking at the profile more carefully reveals the explanation: a single call to cufftExecC2C() involves launching hundreds of kernels. Going through the detailed profile, I see screen after screen full of lines similar to “c2c_transpose, GPU Time 7, CPU Time 70”. So although the total GPU time is not that much, the overhead of launching all those kernels is completely destroying the performance.

Surely there must be a better way of implementing the FFT that does more work in each kernel call? Does anyone know of alternate FFT implementations, or even papers describing better algorithms?

Peter