coalesced memory access post batched 1D cufft what is a good strategy?

e.ping · December 31, 2007, 4:42pm

Hi,

I am doing batched 1D real2complex power-of-two FFTs. After the FFT, I am multiplying the FFTd signals together to form correlation products. However, the results of the batched 1D ffts are arrays of odd size. Since they form a contiguous block of memory (I have allocated one big array similar to page 12 of the FFT manual) and are odd size, this appears to violate the alignment requirements for coalesced memory access for subsequent processing. (Except for every 4th array.)

Am I missing something here or is there no way around this problem because the current FFT library doesn’t support arrays with pitches to get the alignment correct? Any suggestions for ways to minimise this problem?

Thanks,
Randall.

vpodlozhnyuk · January 1, 2008, 8:01pm

Doesn’t “power of two” contradict “odd”? Assuming the power is greater or equal than four, the coalescing alignment rules are still met… What is the transform size range for your app?

e.ping · January 2, 2008, 3:03pm

no, this is real->complex. N samples get transformed into an N/2 + 1 length complex spectrum.

The transform size can be variable, but I am using 128 spectral channels presently. Hence, 256 samples → length 129 spectrum including DC component.

Since it is batched, the start of the second array is at offset 129, so not aligned.

I wonder if I should just use complex->complex and ignore the redundant half of the array?

vpodlozhnyuk · January 3, 2008, 7:20pm

Yes, since currently CUFFT’s real-to-complex paths do not offer noticeable advantage in terms of computational efficiency over compex-to-complex transforms, sticking to the latter is better. There exists a trick with “encoding” two real-to-complex transforms into single complex-to-complex transform, which you might consider trying. (unfortunately don’t have neither link nor more or less distinct name for the trick atm)

tachyon_john · January 3, 2008, 8:40pm

This trick and variations thereof are described in the Numerical Recipes book…

e.ping · January 3, 2008, 8:45pm

I did end up going with the complex->complex transform and got the 3X speed improvement that I was looking for. The only disadvantage of this is you waste some memory, but that wasn’t a problem for me.

The other trick I used was to only accumulate 128 channels in the correlation products rather than 129 including the DC term. (Since that term was being discarded later anyway.)

So the general strategy is:

[font=“Courier”]

nchannels = (say) 128
n_inputs = (say) 64, so n_products = 2080
allocate mem for sizeof(cufftComplex)nchannels2n_batchesn_inputs for input/output of FFT
allocate mem for sizeof(cufftComplex)nchannelsn_products for correlation product(s)
load input samples into buffer, converting to complex
do the batched FFT
set up the device call with (nchannels+1) threads and 1 block for each product

in device code:

calculate which inputs this correlation product is for
loop over batches {
- all threads load fft results for inputs of batch from channel 0 through 128 (hence aligned access)
- accumulate correlation product
}
if (thread_index !=0) store correlation product in result array index (thread_index -1) (discard DC term)

[/font]

(Sorry about the crappy formatting. I cannot get any of these formatting options to actually do anything in the preview.)

This guarantees (for power-of-two FFT sizes) that all reads and writes are aligned and hence coalesced. The DC term is discarded but that is OK for me. (An alternative is to keep the DC term and discard the nyquist term, which would be an easy change.)

Cheers,

Randall.

vpodlozhnyuk · January 3, 2008, 8:55pm

Thanks for the reference!

vpodlozhnyuk · January 3, 2008, 10:45pm

I did end up going with the complex->complex transform and got the 3X speed improvement that I was looking for. The only disadvantage of this is you waste some memory, but that wasn’t a problem for me.

The other trick I used was to only accumulate 128 channels in the correlation products rather than 129 including the DC term. (Since that term was being discarded later anyway.)

So the general strategy is:

[font=“Courier”]

nchannels = (say) 128

n_inputs = (say) 64, so n_products = 2080

allocate mem for sizeof(cufftComplex)nchannels2n_batchesn_inputs for input/output of FFT

allocate mem for sizeof(cufftComplex)nchannelsn_products for correlation product(s)

load input samples into buffer, converting to complex

do the batched FFT

set up the device call with (nchannels+1) threads and 1 block for each product

in device code:

calculate which inputs this correlation product is for

loop over batches {

- all threads load fft results for inputs of batch from channel 0 through 128 (hence aligned access)

- accumulate correlation product

}

if (thread_index !=0) store correlation product in result array index (thread_index -1) (discard DC term)

[/font]

(Sorry about the crappy formatting. I cannot get any of these formatting options to actually do anything in the preview.)

This guarantees (for power-of-two FFT sizes) that all reads and writes are aligned and hence coalesced. The DC term is discarded but that is OK for me. (An alternative is to keep the DC term and discard the nyquist term, which would be an easy change.)

Cheers,

Randall.

[snapback]304444[/snapback]

As far as I understand you calculate correlation of the spectres and just multiply-accumulate the frequency values acting by definition? If so, another thing you might consider is FFT-based correlation, taking into account that {x(n) * y(n)}(k) correlation is equivalent to {x(n) * y(-n)}(k) convolution (thus involving data rearrangement and zero-padding, another direct FFT over already frequency-domain data, pointwise multiplication and 1 / N scaling, IFFT). Since the signals are of the same length, such approach is very likely to be a win, reducing the computational complexity of correlation from O(N ** 2) to O(N * logN)

Topic		Replies	Views
Batched 2D FFT implementation CUDA Programming and Performance	32	48143	January 13, 2011
Applying an element-wise 1D multiplication to the result of a cufft fft GPU-Accelerated Libraries	9	4049	March 24, 2019
Multiple batches of 1D FFT using cuFFT GPU-Accelerated Libraries	10	5103	October 29, 2019
Kernels that modify a 1D FFT Problem because power-of-two issue CUDA Programming and Performance	3	2011	September 25, 2008
Getting Segmentation fault for higher sized input arrays for 2d CuFFT application. CUDA Programming and Performance	15	2892	July 10, 2014
CUFFT Issues CUDA Programming and Performance	2	836	February 4, 2011
CUDA FFT lib and 1D transforms over 3D volume CUDA Programming and Performance	8	13656	December 13, 2010
Efficient 2D memory access for row-based operations on contiguous array CUDA Programming and Performance	4	994	July 11, 2019
CUFFT Batch Behavior for nfft > data length CUDA Programming and Performance	5	8929	August 18, 2011
2D cuFFT using C2C CUDA Programming and Performance cuda	2	731	April 30, 2020

coalesced memory access post batched 1D cufft what is a good strategy?

Related topics