cufft gives strange results

I have encountered in troubles when using cufftPlanMany function to calculate 2D fft. I know that exists a function to do that in a simpler way but I want to use cufftPlanMany to do batch execution.

I am testing the function with a signal of 4x4 points (four rows and four columns) and with batch values 1,2,4,8. When I use a batch value different to 1, I copy the first signal into the others. So, if batch equals 4 then s_0=s_1=s_2=s_3, where s_i means the signal in the i position.

I store the values in row major order and in plane major order. I.e, elements of the same row are consecutive in memory. Two rows of the same signal are consecutive in memory. For example, If I do signal[z][y] I mean the element in the signal number z in the batch (signal in the plane z, the outermost dimension), in the y row and in the x column (x axis is the innermost dimension).

I am using Real to Complex transform so, if original signal is original_signal[NUM_BATCH], forward FFT signal is forward_signal[NUM_BATCH] and backward FFT is [NUM_BATCH].

What happens is that forward FFT is different for each input signal (remember if more there are more than one signal all are equals) and when I do the backward FFT, all of them are equal to original signal.

Below, some fragments of code are given to show how I call the functions (see below the declaration of variables)

check_return_value( cufftPlanMany( &fftHandle,
                                   batch ) );

check_return_value( cufftExecD2Z( fftHandle, 
                                  (cufftDoubleReal*) d_idata,
                                  (cufftDoubleComplex*) d_odata ));

d_idata is declared as

cufftDoubleReal* d_idata;

and it is allocated with

cudaMalloc( (void**) &d_idata, NUM_FFT * SIZE * SIZE * sizeof( cufftDoubleReal ) );

d_odata is declared as

cufftDoubleComplex* d_odata;

and allocated with

cudaMalloc( (void**) &d_odata, NUM_FFT * (SIZE) * (SIZE/2 + 1) * sizeof( cufftDoubleComplex ) );

Variable NUM_FFT equals to batch (8) and SIZE is the number of points in each dimension (4).

I have a function to print the values pass as parameters just before the functions. This is what I get with that function.
n[0] : 4
n[1] : 4
inembed[0] : 4
inembed[1] : 4
istride : 1
idist : 16
onembed[0] : 4
onembed[1] : 4
ostride : 1
odist : 16
batch : 8
Invocation of functions return no errors.
I am using GTX258 card, in Ubuntu 11.10 with the latest driver (304.54). SDK version is 5.0.
I don’t what the problem is. Can anyone help me?

Thanks in advance,

I have realized what happens. When you do a R2C transform, it is supposed the input and output vector size to be SIZE and (SIZE / 2 + 1) respectively. So I thought the following
If input data is x_00 x_01 x_02 x_03 x_10 x_11 x_12 x_13 x_20 x_21 x_22 x_23 x_30 x_31 x_32 x_33 (where x_ij means the element of the i-th row j-th column ).
I expected that output would be x_00 x_01 x_02 x_10 x_11 x_12 x_20 x_21 x_22 x_30 x_31 x_32 (take care that element x_i3 is missing and all elements has shifted to left in memory. Instead of that, what I get is
x_00 x_01 x_02 nan x_10 x_11 x_12 nan x_20 x_21 x_22 nan x_30 x_31 x_32 nan

I haven’t read anything about that in the user’s guide. Does anyone know why it happens?
Thanks in advance,