CUFFT Batch Behavior for nfft > data length

Hi y’all.

I’ve got a situation where I’m interested in taking 3 FFTs where the size of the FFT is greater than the length of the desired data. To be more explicit, I constructed a 1-D array consisting of a concatenation of 3 rows of length 2192. On each of these 2192 “rows”, I’d like to take a 32768 point FFT (with the result being a 3 * 32768 point data array). Would the batch command take care of this when constructing the cufft 1D plan?

Basically, I’m wondering if batch “divides” the input data into “batch” number of segments then takes an n-point FFT of the segment or whether it works in a different manner.

I’ve tried my issue out on my Tesla (batch = 3, nfft = 32768), but I’m seeing unexpected results.

Thanks

Hi,
I’m still wondering what Batch actually does…
My solution would be to set up a plan with batch 1 and the right fft size and then load from your input the fragments padded with trailing zeros (or whatever you need in order to make the input as long as the window).

I’m glad I’m not alone on the intricacies of batch! Anyway, thanks for the suggestion, Murivan. I’m sure that’ll work, but like you, I’d like to know how batch behaves on data where the length of that data is not equal to some integer multiple of the length of the FFT size.

The way I understand (and use) the batch feature is the following:

You have to do N same-size FFTs at the same time. You can, instead of calling the FFT routine N times you call it once with a batch size of N. It might be that in some cases, the FFT library can then parallelize these FFTs since they are independent.

I’m not sure I understand your requirements so I can’t give you an advice what to do in your case. Do you have 3 rows or 32768? Maybe the stride parameters help in your case. Not sure though.

Seb, thanks for the response. I’ve solved my problem with zero padding to the length of the FFT I’d like to take (as Murivan mentioned) and then using the batch command. To be more explicit to my original problem, though, I was wondering if the zero padding would be taken care of for me with the cufft plan I used. Simply put, my input data was 3 rows of length 2192 complex data elements (in a 1D array, like usual). I wanted to perform a 32768 point FFT on each row. I was wondering if simply saying that I’d like to perform 3 instances of the FFT on a data length of size 2192 * 3 with an FFT size of 32768 would yield the correct results.

In other words, I was inquiring how batch really works. Does it take the first nfft points and take the FFT of that then move on to the next nfft points (repeating this process batch number of times), or does batch “split” the data into that number of pieces and perform an nfft point FFT on each “segment”.

At this point, I’m just curious; like I mentioned before, I solved my problem with Murivan’s suggestion.

GPUconv.cu (11.7 KB)

Good for you that u solved your problem :)
If someone can clarify this point i’m still interested. I’m also concerned about “performances” so the best parallelized choice will matter to me.
My case is a bit “atypical”. I have 2 input sequences of a-priori unknown lenght. They are loaded at runtime, but typical values are 1M samples for the first sequence (input) and 90K for the second (filter). I need to perform circular convolution, this mean that i have to transform the filter in only one window, and choose an appropriate “payload” for the input. For comparisons with another approach i choose the payload to be the same of the filter lenght so i have windows of about 180K samples (for circular convolution to take place). So i have 1 call for the filter 90k point + 90k trailing zeros and then N calls for the input signal (90k point from input + 90k trailing zeros). For each fragment i compute the product with a modified version of ComplexPointMul found on simpleCUFFT (basically i don’t use scaling factor which is a mess!). As you notice each fragment is “independent”, so i’m wondering which is the best way to parallelize this task and obtain better performances.
Any help appreciated.
I attach a code snippet (this part is under the branch called “overlap and save”) u can see i have another branch where i use a single window of lenght filter+input (that can be HUGE and so cannot fit in memory) in order to make comparisons with a CPU approach i used some years ago.

Bests.