I’m trying to compute 1D FFT transforms in a batch, in such a way that the input will be a matrix where each row needs to undergo a 1D transform. The supplied fft2_cuda that came with the Matlab CUDA plugin was a tremendous help in understanding what needs to be done. This task is supposed to be relatively simple because the built in 1D FFT transform already supports batching and fft2_cuda does all the rest.
In fft2_cuda 2D FFT transform code, they have the part with:

cufftPlan2d(&plan, N, M, CUFFT_C2C) ;

Naively, I thought it would really be enough to change it into:

cufftPlan1d(&plan, N, CUFFT_C2C, M) ;

To achieve my goal of 1D FFT transforms, M times (Batch size = M) for N members in each transform.
This, alas, does not work. (Well, it does work, it just provides the wrong results :( ).

Why does this not work? Where did I go wrong…?
I have attached the full code for reference.
I feel like there’s something very basic I’m missing here to complete this…

I made some progress… :)
I realized that what cufftExecC2C was doing was performing the FFT’s column-wise instead of row-wise. So all I had to do was to change
cufftPlan1d(&plan, N, CUFFT_C2C, M) ;
into:
cufftPlan1d(&plan, M, CUFFT_C2C, N) ;

and to call this function using:
fft2_cuda(transpose(myMatrix));

This is good but not perfect, because the overhead of transposing large matrices is quite significant. So is there any way to tell cufftExecC2C to go row-wise? Do I need to make a change to one of the pack_c2c functions?

I have attached my code, if someone will ever be interested in something like that.
It takes a 2-D matrix and performs 1D FFT for each and every row separately but using CUDA’s batch mode. The speed-up is about x4-x5 on my system here (8800 GTX).

Note: I only changed pack_c2c and unpack_c2c, so the input right now has to be complex. I didn’t change pack_r2c, so using a matrix with real values instead of complex values will perform the transform column-wise and not row-wise. I didn’t need it so I didn’t change it for now.

is it really possible to decompose an ordinary Complex2Complex FFT2D in a batch of Complex2Complex FFT1D (rows)?
Does it give the same result? Sounds strange to me.
That would be very interesting, since a 4x speedup would allow CudaFFT to be faster than FFTW (at the moment, I get quite faster FFTW C2C 2D transforms, with array sizes up to 1024x1024).