Strange behaviour of CUFFT plan

I have implemented an expression template based library working for both, CPU and GPU arrays, and now I want to add FFT functionalities on GPU by the CUFFT.

I have this code fragment

int len=10;

CudaMatrix<float> A_D(1,len);

A_D = ones<float>(1,len);

// Option 1
cufftHandle plan = DEVICE_FFT_PLAN_C2C(A_D.GetNumElements(),1);

// Option 2
cufftHandle plan;
if (cufftPlan1d(&plan, len, CUFFT_C2C, 1) != CUFFT_SUCCESS){
    fprintf(stderr, "CUFFT error: Plan creation failed"); getch();
    return 0;
}

// Option 3
cufftHandle plan;

// Option 4
DEVICE_FFT_PLAN_C2C(A_D.GetNumElements(),1);

// Option 5
// No code concerning options 1, 2 or 3

The code snippets under options 1, 2, 3 or 4 are mutually exclusive (they are not used simultaneously). Basically, they just calculate a CUFFT plan. Option 5 means that there is none of the instructions under options 1, 2, 3 or 4.

Options 1, 2 or 3
When I use the code snippets under options 1, 2 or 3, after compilation, this code crashes after 1-2 times I run it. Particularly, the code crashes at the ones instruction (which internally consists of a kernel launch) with unknown error.

Options 4 or 5
In this case, the code does not crash and returns correct results.

It seems that the simple declaration or use of a plan in the heap instead of the stack (option 4) makes the code to crash, although not in a deterministic number of launches. Also, the declaration or usage of the plan leads to an “anti-causal” error on a previous instruction (?).

I’m using Visual Studio 2010 and CUDA 5.0.

Anyone can help with this “obscure” phenomenon?

Thanks.

Some more info

I have done two further tests:

  1. Previously, I was compiling under a "Release" mode. If I compile under "Debug" mode, then I have no crash.
  2. If I run the code by Starting the CUDA debugger (NSIGHT), even without breakpoints, while enabling the CUDA memory checker, the program never crashes.
  3. If I run the code through the cuda-memcheck, then I have no crash and cuda-memcheck reports always no error.

A clarification:

There is no relation between A_D and the generation of the plan. I’m not meaning to perform the transform of A_D. The same behaviour appears even if A_D is a float2 array.