The load callback can be used effectively to window data for overlapping DFTs. The trick is to configure CUDA FFT to do non-overlapping DFTs, and use the load callback to select the correct sample using the input buffer pointer and sample offset. For example, if you want to do 1024-pt DFTs on an 8192-pt data set with 50% overlap, you would configure as follows:
int rank = 1; // 1D FFTs
int n[] = { 1024 }; // Size of the Fourier transform
int istride = 1, ostride = 1; // Distance between two successive input/output elements
int idist = 1024, odist = 513; // Distance between batches
int inembed[] = { 0 }; // (ignored for 1D transforms)
int onembed[] = { 0 }; // (ignored for 1D transforms)
cufftPlanMany( &handle, rank, n, inembed, istride, idist, onembed, ostride, odist, CUFFT_R2C, 15 );
cufftExecR2C() is called with an input buffer sized for 8192 real samples, and an output buffer sized for 7695 (15 * 513) complex results.
The magic is in the load callback. We are going to use it to map non-overlapped DFT samples to overlapped DFT samples. Because we configured for non-overlapping DFTs, the load callback gets called for every sample in every DFT, and the offset tells us where the sample is located in the input buffer. However, we will use the offset to determine the sample location in terms of batch number (offset/1024) and offset into the batch (offset%1024) so that we can find the overlapped DFT sample:
overlapped_sample = input_buffer[ (batch_number * 512) + batch_offset ]
With the batch_offset, we can apply the appropriate windowing factor.
device cufftReal
load_callback( void *dataIn, size_t offset, void *callerInfo, void *sharedPtr )
{
cufftReal *in_buffer= (cufftReal *)dataIn;
int batch_number = offset / 1024;
int batch_offset = offset % 1024;
int buffer_offset = (batch_number * 512) + batch_offset;
return ( in_buffer[buffer_offset] * hanning_window[batch_offset] );
}