Thanks, yes, I’ve tried cuFFT Callbacks (CUFFT_CB_LD_REAL) in real-to-complex FFT. It was noticeably slower than just windowing the data first (either by loading pre-computed weights, or by computing weights on the fly) and then FFT’ing the windowed data, without a callback. I suspect the slowdown may be because of ‘float’-sized loads instead of ‘float4’ loads; the cuFFT load callback output should be a single cufftReal.

The expected speedup if data windowing is incorporated implicitly in the twiddle factors depends…

In my case:

- windowing (load data, load/compute coefficient, multiply, store data) : 30 gigasamples/sec
- cuFFT for my FFT size : 16 gigasamples/second

=> total ~10.8 gigasamples/second for windowed FFT

Performing the windowing implicitly via manipulated FFT twiddle factors omits the extra data loads, multiplications, and stores (O(N)). Throughput should be exactly as fast as the FFT. That is, 16 Gs/s, so about x1.5 times faster than explicitly doing windowing+FFT.

Updated twiddle factors would also allow something called “Generalized DFT”.

I’ve looked at FFTW, but it doesn’t offer direct means to update twiddle factors.

Perhas I’ll file a RFE as you suggest, for a new feature (both FFTW and cuFFT)!