Periodically discrepancies between Cufft and FFTW - output

Sorry for some formating issues here, that ’ s my first post to this forum.

I also published this issue on stackoverflow.
For my thesis, I have to optimize a special MPI-Navier Stokes-Solver program with CUDA. The original program uses FFTW for solving several PDEs. In detail, several upper triangle matrices are fourier tranformed in two dimensions, but handled as one dimensional arrays. For the moment, I’m struggling with parts of the original code: (N is always set to 64)

Original (within the .c file):

//Does the complex to real in place fft and normalizes
void fftC2R(double complex *arr) {
      fftw_execute_dft_c2r(plan_c2r, (fftw_complex*)arr, (double*)arr);
      //Currently ignored: Normalization
}

void doTimeStepETDRK2_nonlin_original() {
      //calc velocity
      ux[0] = 0;
      uy[0] = 0;

      for(int i=1; i<N*(N/2+1); i++) {
         ux[i] =  I*kvec[1][i]*qvec[i] / kvec[2][i];
         uy[i] = -I*kvec[0][i]*qvec[i] / kvec[2][i];
      }

      fftC2R(ux);
      fftC2R(uy);

      //do some stuff here...
      //...
      return;
}

//Where ux and uy are allocated as (double complex arrays):
    ux = (double complex*)fftw_malloc(N*(N/2+1) * sizeof(double complex));
    uy = (double complex*)fftw_malloc(N*(N/2+1) * sizeof(double complex));

//The fft-plan is created as:

    plan_c2r = fftw_plan_dft_c2r_2d(N, N,(fftw_complex*) qvec, (double*)qvec, FFTW_ESTIMATE);

//Here are the relevant parts of the CUDA code (within the cuda file):

    NN2_VecSetZero_and_init<<<block_size,grid_size>>>();
    cudaSafeCall(cudaDeviceSynchronize());
    cudaSafeCall(cudaGetLastError());

    int err = (int)cufftExecZ2D(cu_plan_c2r,(cufftDoubleComplex*)sym_ux,(cufftDoubleReal*)sym_ux);

    if (err != CUFFT_SUCCESS ) {
        exit(EXIT_FAILURE);
        return;
    }

    err = (int)cufftExecZ2D(cu_plan_c2r,(cufftDoubleComplex*)sym_uy,(cufftDoubleReal*)sym_uy);

    if (err != CUFFT_SUCCESS ) {
        exit(EXIT_FAILURE);
        return;
    }

    //do some stuff here...
    //...
    return;

//Where sim_ux and sim_uy are allocated as:
    cudaMalloc((void**)&sym_ux, N*(N/2+1)*sizeof(cufftDoubleComplex));
    cudaMalloc((void**)&sym_uy, N*(N/2+1)*sizeof(cufftDoubleComplex));

//The initialization of the relevant cufft parts looks like

if (cufftPlan2d(&cu_plan_c2r,N,N, CUFFT_Z2D) != CUFFT_SUCCESS){
        exit(EXIT_FAILURE);
        return -1;
    }

    if (cufftPlan2d(&cu_plan_r2c,N,N, CUFFT_D2Z)  != CUFFT_SUCCESS){
        exit(EXIT_FAILURE);
        return -1;
    }

    if ( cufftSetCompatibilityMode ( cu_plan_c2r , CUFFT_COMPATIBILITY_FFTW_ALL) != CUFFT_SUCCESS ) {
        exit(EXIT_FAILURE);
        return -1;
    }

    if ( cufftSetCompatibilityMode ( cu_plan_r2c , CUFFT_COMPATIBILITY_FFTW_ALL) != CUFFT_SUCCESS ) {
        exit(EXIT_FAILURE);
        return -1;
    }

So I use full FFTW compatibility and I call every function with the FFTW calling patterns. When I run both versions, I receive almost equal results for ux and uy (sim_ux and sim_uy). But at periodically positions of the arrays, Cufft seems to ignore these elements, where FFTW sets the real part of these elements to zero and calculates the complex parts (the arrays are too large to show them here). The stepcount, for which this occurs, is N / 2+1. So I believe, I didn’t

So I use full FFTW compatibility and I call every function with the FFTW calling patterns. When I run both versions, I receive almost equal results for ux and uy (sim_ux and sim_uy). But at periodically positions of the arrays, Cufft seems to ignore these elements, where FFTW sets the real part of these elements to zero and calculates the complex parts (the arrays are too large to show them here). The stepcount, for which this occurs, is N / 2+1. So I believe, I didn’t completely understand fft padding theory between Cufft and FFTW. I can exclude any former discrepancies between these arrays, until Cufft executions are called. So any other arrays of the code above are not relevant here. My question is: am I too optimistic in using almost 100 percent of the FFTW calling style? Do I have to handle my arrays before the ffts? The Cufft documentation says, I’d have to resize the data input and output arrays. But how can I do this, when I’m running inplace transformations? I really wouldn’t like to go too far distant from the original code and I don’t want to use any more copy instructions for each fft call, because memory is limited and arrays should remain and processed on gpu as long as possible.

I’m thankful for every hint, critical statement or idea!

My configuration:

  • Compiler: gcc 4.6 (C99 Standard)
  • MPI-Package: mvapich2_1.5.1p1 (shouldn't play a role because of reduced single processing debug mode)
  • CUDA-Version: 4.2
  • GPU: CUDA arch compute_20 (NVIDIA GeForce GTX 570)
  • FFTW 3.3