Batched 1D FFTs (using CUFFT and MEX)


I’m trying to compute 1D FFT transforms in a batch, in such a way that the input will be a matrix where each row needs to undergo a 1D transform. The supplied fft2_cuda that came with the Matlab CUDA plugin was a tremendous help in understanding what needs to be done. This task is supposed to be relatively simple because the built in 1D FFT transform already supports batching and fft2_cuda does all the rest.
In fft2_cuda 2D FFT transform code, they have the part with:

cufftPlan2d(&plan, N, M, CUFFT_C2C) ;

Naively, I thought it would really be enough to change it into:

cufftPlan1d(&plan, N, CUFFT_C2C, M) ;

To achieve my goal of 1D FFT transforms, M times (Batch size = M) for N members in each transform.
This, alas, does not work. (Well, it does work, it just provides the wrong results :( ).

Why does this not work? Where did I go wrong…?
I have attached the full code for reference.
I feel like there’s something very basic I’m missing here to complete this…

Thank you,
Y. (1.83 KB)

I made some progress… :)
I realized that what cufftExecC2C was doing was performing the FFT’s column-wise instead of row-wise. So all I had to do was to change
cufftPlan1d(&plan, N, CUFFT_C2C, M) ;
cufftPlan1d(&plan, M, CUFFT_C2C, N) ;

and to call this function using:

This is good but not perfect, because the overhead of transposing large matrices is quite significant. So is there any way to tell cufftExecC2C to go row-wise? Do I need to make a change to one of the pack_c2c functions?


I made a change to the pack_c2c and to the unpack_c2c functions… everything is working, thanks!


Moderators: This whole thread can be removed, with my apology. Question was asked and answered (by me!), all in the same afternoon… :)


Why delete? Someone might have this issue in the future and find this via the Search.

Cool, maybe you can publish your code for all users.
What speedups have you achieved?

I have attached my code, if someone will ever be interested in something like that.
It takes a 2-D matrix and performs 1D FFT for each and every row separately but using CUDA’s batch mode. The speed-up is about x4-x5 on my system here (8800 GTX).

Note: I only changed pack_c2c and unpack_c2c, so the input right now has to be complex. I didn’t change pack_r2c, so using a matrix with real values instead of complex values will perform the transform column-wise and not row-wise. I didn’t need it so I didn’t change it for now.

Y. (1.93 KB)


is it really possible to decompose an ordinary Complex2Complex FFT2D in a batch of Complex2Complex FFT1D (rows)?
Does it give the same result? Sounds strange to me.
That would be very interesting, since a 4x speedup would allow CudaFFT to be faster than FFTW (at the moment, I get quite faster FFTW C2C 2D transforms, with array sizes up to 1024x1024).

Thanks for any advice.

some common CUDA MATLAB errors and solutions are available here