I did a 400point FFT on my input data using 2 methods:

C2C Forward transform with length nx*ny and

R2C transform with length nx*(nyh+1)
Observations when profiling the code:
Method 1 calls SP_c2c_mradix_sp_kernel 2 times resulting in 24 usec.
Method 2 calls SP_c2c_mradix_sp_kernel 12.32 usec and SP_r2c_mradix_sp_kernel 12.32 usec.
So eventually there’s no improvement in using the realtocomplex transform over the complextocomplex transform. Theoretically, there should be an improvement as Method 2 uses only half the size of the second dimension. Am I missing something? This is also mentioned in page 21 of the CUFFT_Library_3.1 Manual.
Secondly, my results are not matching using the R2C transform between CUFFT and FFTW. Don’t know what’s the issue here…?
double* ffcorr1;
cufftComplex *f1_d;
cudaMalloc((void**) &ffcorr1, sizeof(double) * pix3);
cudaMalloc((void**) &f1_d, sizeof(cufftComplex) * pix1 * (pix2/2 + 1));
// create plan for CUDA FFT
cufftHandle plan_forward1;
CUFFT_SAFE_CALL(cufftPlan2d(&plan_forward1, pix1, pix2, CUFFT_R2C));
CUFFT_SAFE_CALL(cufftExecR2C(plan_forward1, (cufftReal*) ffcorr1, f1_d)); //cast double* ffcorr1 as cufftReal*
//Destroy CUFFT context
CUFFT_SAFE_CALL(cufftDestroy(plan_forward1));
double* ffcorr1;
fftw_complex *f1;
ffcorr1 = (double*) malloc(sizeof(double) * pix3);
f1 = fftw_malloc ( sizeof ( fftw_complex ) * pix1 * (pix2/2+1) * n);
plan_forward1 = fftw_plan_dft_r2c_2d ( pix1, pix2, ffcorr1, f1, FFTW_ESTIMATE );
fftw_execute ( plan_forward1 );