Writing custom FFT for sizes other than powers of 2

I am dealing with an image of 20x20 pixels and I am trying to implement my algorithm that needs to do FFT-based correlation on 1700 images 2000 times a second.

I am currently using CUFFT and will also try using the Nakura FFT library posted here. However, I would like some feedback whether it’s worthwhile writing a custom FFT for the size I want (400-point transform)? I am trying to determine how much performance improvement it will give over the general purpose CUFFT library?

Thanks in advance for all the suggestions

I guessing like me, you found CUFFT to have poorer than expected performance (e.g. 100 gflops), especially for the real to complex & complex to real transforms.

NVIDIA made some improvements for those cases in CUDA 3.1 - not sure if they use a specialized real to complex version or use the 2 real FFTs for the price of 1 complex FFT technique.

According to the CUDA 3.2 RC release notes, NVIDIA also added support for 3, 5, 7 radices to CUFFT, so a 20 point FFT (both 1D & 2D) should have all prime factors handled by CUDA 3.2’s CUFFT. You can try experimenting.

You can always try to implement your own FFT and see all the creative divide & conquer methods & practical optimizations (e.g. in-place methods) like me, but that will be a lot of work. You can start by looking at Vasily Volkov’s work, which I suspect was actually used in CUFFT. There’s also the ubiquitous FFTW library that documents the best practical methods.

The one FFT that I’m not sure if anyone has done efficiently in CUDA is a 2D FFT specialized for real valued data like you want. I started implementing my own vector-radix version a few months ago, but put it on hold after CUFFT greatly improved performance.

I guessing like me, you found CUFFT to have poorer than expected performance (e.g. 100 gflops), especially for the real to complex & complex to real transforms.

NVIDIA made some improvements for those cases in CUDA 3.1 - not sure if they use a specialized real to complex version or use the 2 real FFTs for the price of 1 complex FFT technique.

According to the CUDA 3.2 RC release notes, NVIDIA also added support for 3, 5, 7 radices to CUFFT, so a 20 point FFT (both 1D & 2D) should have all prime factors handled by CUDA 3.2’s CUFFT. You can try experimenting.

You can always try to implement your own FFT and see all the creative divide & conquer methods & practical optimizations (e.g. in-place methods) like me, but that will be a lot of work. You can start by looking at Vasily Volkov’s work, which I suspect was actually used in CUFFT. There’s also the ubiquitous FFTW library that documents the best practical methods.

The one FFT that I’m not sure if anyone has done efficiently in CUDA is a 2D FFT specialized for real valued data like you want. I started implementing my own vector-radix version a few months ago, but put it on hold after CUFFT greatly improved performance.