cuFFT plan twiddle factors?

Hi, has anyone by any chance reverse-engineered the layout of cuFFT plans?

The context of this question is FFT windowing (Hamming, Hann, other). Windowing would be quite efficient if the coefficients of the window function could be incorporated directly in the FFT twiddle factor matrix. The cuFFT plan memory location and layout for cuFFT twiddle factors would need to be known, though, so that the twiddle factors can be changed to more suitable ones. Anyone had any luck trying to figure out this for cuFFT?


This is not my area of expertise, but it seems to me that you should be able to use the CUFFT callback feature to implement windowing. Is that not the case? How much of a speedup would you anticipate from a direct manipulation of the twiddle factors?

I know nothing about the internals of CUFFT plans, so I wonder whether one can even assume that it incorporates a “twiddle factor matrix”. Reverse engineering such internals may be fun, but since the internal structure of plans could (and probably does) change with every release, this seems like a very fragile approach from a software engineering perspective.

It may make more sense to research what general mechanisms other plan-based FFT implementations (in particular: FFTW) offer to support windowing, and then file an RFE (request for enhancement) with NVIDIA for a similar mechanism for CUFFT.

Thanks, yes, I’ve tried cuFFT Callbacks (CUFFT_CB_LD_REAL) in real-to-complex FFT. It was noticeably slower than just windowing the data first (either by loading pre-computed weights, or by computing weights on the fly) and then FFT’ing the windowed data, without a callback. I suspect the slowdown may be because of ‘float’-sized loads instead of ‘float4’ loads; the cuFFT load callback output should be a single cufftReal.

The expected speedup if data windowing is incorporated implicitly in the twiddle factors depends…

In my case:

  • windowing (load data, load/compute coefficient, multiply, store data) : 30 gigasamples/sec
  • cuFFT for my FFT size : 16 gigasamples/second
    => total ~10.8 gigasamples/second for windowed FFT

Performing the windowing implicitly via manipulated FFT twiddle factors omits the extra data loads, multiplications, and stores (O(N)). Throughput should be exactly as fast as the FFT. That is, 16 Gs/s, so about x1.5 times faster than explicitly doing windowing+FFT.

Updated twiddle factors would also allow something called “Generalized DFT”.

I’ve looked at FFTW, but it doesn’t offer direct means to update twiddle factors.

Perhas I’ll file a RFE as you suggest, for a new feature (both FFTW and cuFFT)!

I wonder what a general-purpose implementation-independent interface for FFT twiddle-factor manipulation might look like. Consider that twiddle factors might be computed on the fly rather than pre-computed and stored (I don’t claim any knowledge how either CUFFT or FFTW manage twiddle factors).

It might be a good idea to ponder this down to some detail before filing those RFEs, because FFTs and windowing are a common enough scenario that “if this were easy, somebody would have done it by now”.