CUFFT_EXEC_FAILED?

So I’m trying to write a program, part of which involves calculating 16K 128-point FFTs on a bunch of data. Here’s how I’m creating my plan:

 // Setup FFT plan

  cufftResult status = cufftPlan1d(&output_fft, num_channels, CUFFT_R2C, PTS_PER_CHAN);

  if (status != CUFFT_SUCCESS) 

    printf("Error creating forward FFT plan!\n");

num_channels = 128

PTS_PER_CHAN = 16K

This executes fine, without any errors. and now the function where I’m using it:

// ----------------------------------------------------

// Processes an input buffer of data

void process_input(char* inp_buffer, Complex* out_buffer) {

  // Copy input buffer to device

  cudaMemcpy(dev_inp_buffer, inp_buffer, sizeof(char)*io_buff_size, cudaMemcpyHostToDevice);

 // Run filter on input buffer

  run_filter<<<1, num_legs>>>(dev_inp_buffer, dev_filt_buffer, dev_sig_buffer, dev_out_sf, taps_per_leg, num_legs);

  

  // Calculate FFT on the output

  cufftResult status = cufftExecR2C(output_fft, (cufftReal*)dev_out_sf, (cufftComplex*)dev_out_cf);

  

  // Copy output back to host

  cudaMemcpy(out_buffer, dev_out_cf, sizeof(Complex)*io_buff_size, cudaMemcpyDeviceToHost);

}

inp_buffer and out_buffer are both host arrays created using cudaMallocHost, every other buffer prefixed with dev was created with cudaMalloc. But this execution of the FFT fails with CUFFT_EXEC_FAILED, and I’m at a loss to explain why, I’ve got other stuff using FFTs that seems to run fine. Thoughts?

Edit:

I should note I’m on RHEL 4, no X Server is running and I’m running on V2.0 of the toolkit

I tried commenting out the kernel call and the FFT calls seem to work just fine then. Must be clobbering memory somehow.