How to call "cufft_c2c_radix2"

I use the “cufft_c2c_radix2” function with the following parameters and it generates incorrect result.

Let say I would like to generate the fft for the following signal:

BLOCK_SIZE = 8;

signalSize = 16 (cufftComplex)
theta = 2 * PI / signalSize;
base = 3;

strd.ibStride = BLOCK_SIZE; // Is this correct?
strd.ieStride = 1; // Is this correct?
strd.obStride = BLOCK_SIZE;
strd.oeStride = 1;

dim3 dimBlock(BLOCK_SIZE, 1);
dim3 dimGrid signalSize / dimBlock.x, 1);

smemSize = sizeof(cufftComplex) * signalSize;

cufft_c2c_radix2<<<dimGrid, dimBlock, smemSize>>>smemSize, // Signal size in complex elements
theta, // 2 * Pi / N
base, // log base 2 of N
d_inImg, // Pointer to input signal in global memory
d_outImg, // Pointer to output array in global memory
CUFFT_FORWARD, // FFT direction: -1 is forward, 1 is inverse
strd); // Input and output block and elements strides

Can someone explain to me why this setup will generate incorrect result?

In addition, in the “cufft_kernels.h” file, it mentions that in order to “perform M transforms of size N, set grid.y = N, and thread.x = N / R.”.
I thought the grid.y parameter should contain the number of signals (M) in a batch! Am I missing something?

Thank you!