FFT Cuda implementation


In my code, I need to implement 1D FFT algorithm to run efficiently on GPU. Where can I find such implementation? Maybe a source code from the Cufft library?

I want to run FFT and more operations on the same kernel, but Cufft library-functions cant be launched from a kernel, so I figured that I need to implement the FFT by myself. Is there a better solution?

One possible approach is to finish/end your pre-processing kernel. Call the FFT from CUFFT. Then launch a new kernel to finish whatever post-processing is needed.

Thank you for your answer.
My motivation to run the FFT and the other operations on the same kernel is to avoid L1 cache trashing. all the operations use the same data so using multiply kernel will decrease performance.

You might be interested in cufftDx the device extensions for cufft.