Periodically discrepancies between Cufft and FFTW - output

Petrucciation · January 5, 2013, 5:14pm

Sorry for some formating issues here, that ’ s my first post to this forum.

I also published this issue on stackoverflow.
For my thesis, I have to optimize a special MPI-Navier Stokes-Solver program with CUDA. The original program uses FFTW for solving several PDEs. In detail, several upper triangle matrices are fourier tranformed in two dimensions, but handled as one dimensional arrays. For the moment, I’m struggling with parts of the original code: (N is always set to 64)

Original (within the .c file):

//Does the complex to real in place fft and normalizes
void fftC2R(double complex *arr) {
      fftw_execute_dft_c2r(plan_c2r, (fftw_complex*)arr, (double*)arr);
      //Currently ignored: Normalization
}

void doTimeStepETDRK2_nonlin_original() {
      //calc velocity
      ux[0] = 0;
      uy[0] = 0;

      for(int i=1; i<N*(N/2+1); i++) {
         ux[i] =  I*kvec[1][i]*qvec[i] / kvec[2][i];
         uy[i] = -I*kvec[0][i]*qvec[i] / kvec[2][i];
      }

      fftC2R(ux);
      fftC2R(uy);

      //do some stuff here...
      //...
      return;
}

//Where ux and uy are allocated as (double complex arrays):
    ux = (double complex*)fftw_malloc(N*(N/2+1) * sizeof(double complex));
    uy = (double complex*)fftw_malloc(N*(N/2+1) * sizeof(double complex));

//The fft-plan is created as:

    plan_c2r = fftw_plan_dft_c2r_2d(N, N,(fftw_complex*) qvec, (double*)qvec, FFTW_ESTIMATE);

//Here are the relevant parts of the CUDA code (within the cuda file):

    NN2_VecSetZero_and_init<<<block_size,grid_size>>>();
    cudaSafeCall(cudaDeviceSynchronize());
    cudaSafeCall(cudaGetLastError());

    int err = (int)cufftExecZ2D(cu_plan_c2r,(cufftDoubleComplex*)sym_ux,(cufftDoubleReal*)sym_ux);

    if (err != CUFFT_SUCCESS ) {
        exit(EXIT_FAILURE);
        return;
    }

    err = (int)cufftExecZ2D(cu_plan_c2r,(cufftDoubleComplex*)sym_uy,(cufftDoubleReal*)sym_uy);

    if (err != CUFFT_SUCCESS ) {
        exit(EXIT_FAILURE);
        return;
    }

    //do some stuff here...
    //...
    return;

//Where sim_ux and sim_uy are allocated as:
    cudaMalloc((void**)&sym_ux, N*(N/2+1)*sizeof(cufftDoubleComplex));
    cudaMalloc((void**)&sym_uy, N*(N/2+1)*sizeof(cufftDoubleComplex));

//The initialization of the relevant cufft parts looks like

if (cufftPlan2d(&cu_plan_c2r,N,N, CUFFT_Z2D) != CUFFT_SUCCESS){
        exit(EXIT_FAILURE);
        return -1;
    }

    if (cufftPlan2d(&cu_plan_r2c,N,N, CUFFT_D2Z)  != CUFFT_SUCCESS){
        exit(EXIT_FAILURE);
        return -1;
    }

    if ( cufftSetCompatibilityMode ( cu_plan_c2r , CUFFT_COMPATIBILITY_FFTW_ALL) != CUFFT_SUCCESS ) {
        exit(EXIT_FAILURE);
        return -1;
    }

    if ( cufftSetCompatibilityMode ( cu_plan_r2c , CUFFT_COMPATIBILITY_FFTW_ALL) != CUFFT_SUCCESS ) {
        exit(EXIT_FAILURE);
        return -1;
    }

So I use full FFTW compatibility and I call every function with the FFTW calling patterns. When I run both versions, I receive almost equal results for ux and uy (sim_ux and sim_uy). But at periodically positions of the arrays, Cufft seems to ignore these elements, where FFTW sets the real part of these elements to zero and calculates the complex parts (the arrays are too large to show them here). The stepcount, for which this occurs, is N / 2+1. So I believe, I didn’t

Petrucciation · January 5, 2013, 5:58pm

So I use full FFTW compatibility and I call every function with the FFTW calling patterns. When I run both versions, I receive almost equal results for ux and uy (sim_ux and sim_uy). But at periodically positions of the arrays, Cufft seems to ignore these elements, where FFTW sets the real part of these elements to zero and calculates the complex parts (the arrays are too large to show them here). The stepcount, for which this occurs, is N / 2+1. So I believe, I didn’t completely understand fft padding theory between Cufft and FFTW. I can exclude any former discrepancies between these arrays, until Cufft executions are called. So any other arrays of the code above are not relevant here. My question is: am I too optimistic in using almost 100 percent of the FFTW calling style? Do I have to handle my arrays before the ffts? The Cufft documentation says, I’d have to resize the data input and output arrays. But how can I do this, when I’m running inplace transformations? I really wouldn’t like to go too far distant from the original code and I don’t want to use any more copy instructions for each fft call, because memory is limited and arrays should remain and processed on gpu as long as possible.

I’m thankful for every hint, critical statement or idea!

My configuration:

Compiler: gcc 4.6 (C99 Standard)
MPI-Package: mvapich2_1.5.1p1 (shouldn't play a role because of reduced single processing debug mode)
CUDA-Version: 4.2
GPU: CUDA arch compute_20 (NVIDIA GeForce GTX 570)
FFTW 3.3

Topic		Replies	Views
cuFFT and FFTW ciscrepancies GPU-Accelerated Libraries	1	1340	February 5, 2013
output of CUFFT not centered like FFTW CUDA Programming and Performance	7	7585	September 28, 2011
result of cufft is different with fftw CUDA Programming and Performance	0	515	June 1, 2014
Wrong results in CUFFT! CUDA Programming and Performance	4	5480	March 22, 2011
CUFFT Library GPU-Accelerated Libraries	1	557	October 4, 2017
CUFFT run wrong CUDA Programming and Performance	16	2837	May 23, 2013
Wrong results produced by Cufft GPU-Accelerated Libraries	8	1115	October 4, 2018
cuFFT and fftw CUDA Programming and Performance	10	4210	August 25, 2010
Can somebody explain this? (cufft strange result!) CUDA Programming and Performance	3	2062	September 18, 2009
Cufft problem input data overwritten GPU-Accelerated Libraries cufft	2	8533	July 28, 2021

Periodically discrepancies between Cufft and FFTW - output

Related topics