coalesced memory access post batched 1D cufft what is a good strategy?


I am doing batched 1D real2complex power-of-two FFTs. After the FFT, I am multiplying the FFTd signals together to form correlation products. However, the results of the batched 1D ffts are arrays of odd size. Since they form a contiguous block of memory (I have allocated one big array similar to page 12 of the FFT manual) and are odd size, this appears to violate the alignment requirements for coalesced memory access for subsequent processing. (Except for every 4th array.)

Am I missing something here or is there no way around this problem because the current FFT library doesn’t support arrays with pitches to get the alignment correct? Any suggestions for ways to minimise this problem?


Doesn’t “power of two” contradict “odd”? Assuming the power is greater or equal than four, the coalescing alignment rules are still met… What is the transform size range for your app?

no, this is real->complex. N samples get transformed into an N/2 + 1 length complex spectrum.

The transform size can be variable, but I am using 128 spectral channels presently. Hence, 256 samples -> length 129 spectrum including DC component.

Since it is batched, the start of the second array is at offset 129, so not aligned.

I wonder if I should just use complex->complex and ignore the redundant half of the array?

Yes, since currently CUFFT’s real-to-complex paths do not offer noticeable advantage in terms of computational efficiency over compex-to-complex transforms, sticking to the latter is better. There exists a trick with “encoding” two real-to-complex transforms into single complex-to-complex transform, which you might consider trying. (unfortunately don’t have neither link nor more or less distinct name for the trick atm)

This trick and variations thereof are described in the Numerical Recipes book…

I did end up going with the complex->complex transform and got the 3X speed improvement that I was looking for. The only disadvantage of this is you waste some memory, but that wasn’t a problem for me.

The other trick I used was to only accumulate 128 channels in the correlation products rather than 129 including the DC term. (Since that term was being discarded later anyway.)

So the general strategy is:


  • nchannels = (say) 128

  • n_inputs = (say) 64, so n_products = 2080

  • allocate mem for sizeof(cufftComplex)nchannels2n_batchesn_inputs for input/output of FFT

  • allocate mem for sizeof(cufftComplex)nchannelsn_products for correlation product(s)

  • load input samples into buffer, converting to complex

  • do the batched FFT

  • set up the device call with (nchannels+1) threads and 1 block for each product

in device code:

  • calculate which inputs this correlation product is for

  • loop over batches {

    • all threads load fft results for inputs of batch from channel 0 through 128 (hence aligned access)

    • accumulate correlation product


  • if (thread_index !=0) store correlation product in result array index (thread_index -1) (discard DC term)


(Sorry about the crappy formatting. I cannot get any of these formatting options to actually do anything in the preview.)

This guarantees (for power-of-two FFT sizes) that all reads and writes are aligned and hence coalesced. The DC term is discarded but that is OK for me. (An alternative is to keep the DC term and discard the nyquist term, which would be an easy change.)



Thanks for the reference!

As far as I understand you calculate correlation of the spectres and just multiply-accumulate the frequency values acting by definition? If so, another thing you might consider is FFT-based correlation, taking into account that {x(n) * y(n)}(k) correlation is equivalent to {x(n) * y(-n)}(k) convolution (thus involving data rearrangement and zero-padding, another direct FFT over already frequency-domain data, pointwise multiplication and 1 / N scaling, IFFT). Since the signals are of the same length, such approach is very likely to be a win, reducing the computational complexity of correlation from O(N ** 2) to O(N * logN)