Using cuFFT callbacks for FFT windowing

Am interested in using cuFFT to implement overlapping 1024-pt FFTs on a 8192-pt input dataset and is windowed (e.g. hanning window). That is, the number of batches would be 8 with 0% overlap (or 12 with 50% overlap). I followed the example in the 2014 Pro-Tip ( and used a “load” callback to multiply the time-domain input by the windowing function. With 0% overlap, everything seems okay, but it doesn’t appear to be windowing quite right when FFTs are overlap (in_distance is less than Nfft). As part of debugging, I displayed the “offset” sent to the callback routine: while the offset values vary between 0 and 8191 (as they should), the number of offsets seems to be limited to 4096. That means that they are gaps in the offset values. Is 4096 some sort of hard limit?

There isn’t any hard limit around 4096.

Thanks for the response. Can you think of any reason why I don’t see all 8192 values show up when I put a simple printf("%d\n",offset) in the Load Callback function?

According to the cuFFT documentation, the load callback routine is called “once and only once” for each point in the input (which in this case is 8192-pts).

GPU-side printf has limitations. You may be running into a buffer size limit.

Good hypothesis - and correct, I think. I limited the number of printfs on the GPU by conditioning it within an “if” statement, and it looks like all 8192 offsets are being sent to the callback. All good, but in debugging this - I don’t think the cuFFT callback method will work for windowing FFTs when the input stride is less than the length of the FFT (i.e. overlapping FFTs) since the callback is called only once for each point in the input. When the input stride is less than the length of the FFT, the callback routine would have to be called more than once. Consider a 50% overlap - for example, let’s say I have my 8192-pt input, my FFT is 1024-pts, and my input stride is 512-pts. The first FFT in the fftplan should be input points [0…1023], multiplied by the 1024-pt windowing function. However, the second FFT in the fftplan should be input points [512…1535] multiplied by the same 1024-pt windowing function. So as you can see, the windowed input for points 512 to 1023 are different, depending on which FFT in the “batch” you are on?

Am I missing something, and there is a way to get the “load callback” to be called for every fft in the fftplan? I am thinking I will need to implement the windowing (and FFTs) one at a time in a grid-stride loop, instead of the efficient MakePlanMany call.

After some thought I would agree that input callbacks for overlapping transforms won’t be readily possible, currently. The reason might be different than what you are stating. I’m not 100% convinced that only one load will be done per data element (I agree only one load will be done per data element per batch item), but I don’t want to argue about that, because there is a more fundamental issue, namely that the data provided to the callback routine (pointer to data, element index) is not sufficient to distinguish which batch ID/element is actually being requested, even if there were 1 load per data element for each item in the batch.

We’re looking at ways to improve CUFFT efficiency generally, and if you’re able to provide more details, we’re interested in your use case.

  1. anything you can share about motivation or science domain that has these overlapping FFTs? What is the problem being solved?
  2. Are the sizes of 8192-pt input, 1024-pt FFT, 512-pt stride relatively common or fixed for your use case? If not, what are typical ranges for these parameters?
  3. Are there other preprocessing steps right before this FFT, or postprocessing steps right after this FFT, that you care to discuss?
  4. Are you doing work on the GPU prior to this FFT? After this FFT?


Sure - happy to answer (especially if CUFFT can be improved)!

Just doing digital signal processing…

  1. I almost always window and overlap the FFTs. Windowing is necessary to prevent artifacts from spreading power across the spectrum. This spectral leakage is inherent due to the fact that the DFT assumes signal infinite periodicity - which is not the case if the signal freq is not a perfect multiple of the frequency resolution. So the Windowing helps reduce the leakage - by introducing a tapering of the input signal over the N points - typically these windowing functions (e.g. Hanning) go to 0 at the boundaries. All good, right? Except for the fact that since the window goes to zero at the boundaries, if there is a signal there, you get an amplitude reduction in the resulting FFT: not good. Hence the overlap - with a 50% overlap, you are now centering an FFT window right over the spot where you were losing signal…so you are now guaranteed to capture any input signal without amplitude loss…regardless of the freuqency.
  2. The sizes I gave are just notional examples: I regularly use a wide variety of 2^N FFT lengths
  3. No other pre-processing steps (data is coming from an ADC). Lots of postprocessing after the FFTs, but not relevant right now.
  4. This is my first venture in GPUs. The plan is to do as much processing and data reduction on the GPU as possible - and then get it to the host.

I hope this helps.

Also, I should probably start a new thread, but since I have your attention:

There may be a bug in the cufftMakePlanMany call for CUFFT_C2C types, regarding the output distance parameter (odist). For CUFFT_R2C types, I can change odist and see a commensurate change in resulting workSize. However, for CUFFT_C2C, it seems that odist has no effect, and the effective odist corresponds to Nfft. It’s possible that I don’t quite understand how workSize works, but the behavior between R2C and CFC is definitely different.


I’m also in a desperate need for a cuFFT implementation of overlapping windowed FFTs. A typical example is an input of about 10000pts, a window of 3072pts, 4096pts FFT, with strides of 3.
One will then also need a zeropad of 1024pts before conversion, although it might be possible to solve that with longer windows with a zero tail.
So in total about 2300 4096pts FFTs per input sequence.

Currently I need to loop over the data to apply the window, which is slow and non-appealing solution. Speed-wise it’s about the same as the CPU case.

This thread has been quiet now for some time, so have there been any developments recently? Or, maybe, there is a solution that I’ve missed?

I am unaware of any development/update to make cuFFT work for overlapped and windowed FFTs. I had assumed that, unfortunately, nothing was done: I was forced to resort to an alternative method - which is/was not nearly as elegant and as simple as the load callback would have been…but at least it is functional. Sorry to be of no help here - wish the NVidia folks could have come through for me.

I’m also interested in having this feature supported. Is there any plan to have this in cuFFT?

Generally speaking, you won’t get answers to forward-looking questions on these forums. I’m not saying it never happens, but there are significant issues in making such statements, so in my experience its rare for NVIDIA to comment that way.

A good way to log your interest is by filing a bug using the instructions linked in a sticky post at the top of the CUDA programming sub-forum. If the development teams need any answers to clarifying questions they will use that vehicle to ask them.

In your bug report you can use the word “enhancement” which will help clarify it, and link back to this forum thread if you like.