cuFFT library Question on cufftExecC2C() behavior


I am using cuFFT library as shown by the following skeletal code example:

int mem_size = signal_size * sizeof(cufftComplex);
cufftComplex * h_signal = (Complex*)malloc(mem_size);
/** fill up h_signal with some random values… **/

cudaError_t status = cudaMalloc((void**)&d_signal, mem_size);
if (status != CUDA_SUCCESS) { /* get CUDA error string… */ }

status = cudaMemcpy(d_signal, h_signal, mem_size, cudaMemcpyHostToDevice);
if (status != CUDA_SUCCESS) { /* get CUDA error string… */ }

cufftResult_t fft_status = cufftPlanMany( &plan, 1, &signal_size,
NULL, 1, 0, NULL, 1, 0,
CUFFT_C2C, batches);
if (fft_status != CUFFT_SUCCESS) { /* get FFT error string… */ }

// create and start timer
cudaEvent_t start, stop;
cudaEventCreate( &start );
cudaEventCreate( &stop );

cudaEventRecord( start, 0 );

// Transform signal and kernel
fft_status = cufftExecC2C(plan, (cufftComplex *)d_signal, (cufftComplex )d_signal, CUFFT_FORWARD);
if (fft_status != CUFFT_SUCCESS) { /
get FFT error string… */ }

cudaEventRecord( stop, 0 );
cudaEventSynchronize( stop );
float elapsed_time_msec;
cudaEventElapsedTime( &elapsed_time_msec, start, stop );

For my experiment, I am using 512 element FFT (signal_size in the above code example) and I am varying the number of batches from say, 1 to 1024 by multiples of 2. I get valid measurement of time across cufftExecC2C call until 256 batches. When the value for batch is set to 512, the elapsed time becomes zero, but I don’t get any error. When batch is set to 1024, I get an error: CUFFT_EXEC_FAILED, which is probably due to resource limitation.
My question is what’s happening when batch value is set to 512 - no error, but time is zero, as if the kernel did not launch from within the FFT library. Checking for cudaGetLastError() also does not show any error.

I am using a GTX460 card for running this code. Code is compiled within Visual Studio using Cuda 3.2 (Windows 7).

Has anyone else seen this issue or can you suggest anyway to debug?

Another thing: i am using 1D FFT. Should I be using the cufftPlan1d() instead? I saw a comment in the header file that use of ‘batches’ in cufftPlan1d is deprecated, and suggests using cufftPlanMany() instead.

Thanks in advance for any suggestion,