cufft wrapper for FFTW segfaults


I am currently in the process of converting a previously working FFTW stack running on a TX2, to cuFFT.

I got to the point where everything is linked and compiles without error, but it’s sefaulting.

When using the wrapper, do we still need to use the cumalloc for declaring memory? Or is it intelligent enough to convert from general memory, since the Tegra shares ram?

My current procedure in C++:

float input = new float[INPUT_SIZE]();
Complex output = new Complex(INPUT_SIZE/2 +1);

fft_planRange = fftwf_plan_dft_r2c_1d(INPUT_SIZE/2 +1, input, reinterpret_cast<fftwf_complex*>(&output), FFTW_PATIENT);

... fill buffers 

fftwf_execute_dft_r2c(fft_planRange, input, reinterpret_cast<fftwf_complex*>(&output));

followup: segfault resolved, was due to a bound problem outside the FFT loop.

However, FFT performance seems slow compared to FFTW on the general cores.

Is there clarification on how memory is being utilized for the GPU calls? Since I am not yet using cudaMalloc, is the program doing extra memory copies at execution time due to the wrapper?