Wrong results produced by Cufft

Hi everyone,

I work on a 2000x2000 real matrix from a csv file and I want to perform a fft on it, like on matlab, scilab “fft2(Matrix)”, but I’am not able to get the same results…

int Nx, Ny;

cufftDoubleReal* champ = mCSVReader->getRealData(Nx, Ny);

if (!champ && Nx == -1 && Ny == -1)
{
	printf("[ERROR] Invalid input matrix !\n");
        return EXIT_FAILURE;
}

cufftDoubleComplex* champFFT = (cufftDoubleComplex*) malloc(Nx * Ny * sizeof(cufftDoubleComplex));

if (!launchFFT_R2C(champ, champFFT, Nx, Ny))
{
	printf("[ERROR] Unable to perform fft() !\n");
	return EXIT_FAILURE;
}

printComplexMatrix(champFFT, Nx, Ny, 5, true);

I have tried with the functions below :

__host__ bool launchFFT_R2C(const cufftDoubleReal * dataIn, cufftDoubleComplex * dataOut, int col, int row)
{
	cufftResult_t result;
	cufftHandle plan;

	cufftDoubleReal *d_idata;
	cufftDoubleComplex *d_odata;

	cudaMalloc(&d_idata, row * col * sizeof(cufftDoubleReal));
	cudaMalloc(&d_odata, row * col * sizeof(cufftDoubleComplex));

	cudaMemcpy(d_idata, dataIn, row * col * sizeof(cufftDoubleReal), cudaMemcpyHostToDevice);

	result = cufftPlan2d(&plan, row, col, CUFFT_D2Z); ERROR_HANDLER_CUFFT
	result = cufftExecD2Z(plan, d_idata, d_odata); ERROR_HANDLER_CUFFT

	cudaMemcpy(dataOut, d_odata, sizeof(cufftDoubleComplex) * row * col, cudaMemcpyDeviceToHost);

	cudaFree(d_idata);
	cudaFree(d_odata);
	cufftDestroy(plan);

	return true;
}
__host__ bool launchFFT_C2C(const cufftDoubleComplex * dataIn, cufftDoubleComplex * dataOut, int NX, int NY, int direction)
{
	cudaError_t err;
	cufftDoubleComplex *gpu_initial_array;
	cufftDoubleComplex *gpu_transformed_array;

	err = cudaMalloc(&gpu_initial_array, NX * NY * sizeof(cufftDoubleComplex)); ERROR_HANDLER_CUDA
	err = cudaMalloc(&gpu_transformed_array, NX * NY * sizeof(cufftDoubleComplex)); ERROR_HANDLER_CUDA
	err = cudaMemcpy(gpu_initial_array, dataIn, NX * NY * sizeof(cufftDoubleComplex), cudaMemcpyHostToDevice); ERROR_HANDLER_CUDA

	err = cudaMemcpy2D(gpu_initial_array, sizeof(cufftDoubleComplex), dataIn, sizeof(double), sizeof(double), NX * NY, cudaMemcpyHostToDevice);

	cufftHandle plan;

	cufftResult_t result;
	result = cufftPlan2d(&plan, NX, NY, CUFFT_Z2Z); ERROR_HANDLER_CUFFT
	result = cufftExecZ2Z(plan, gpu_initial_array, gpu_transformed_array, direction); ERROR_HANDLER_CUFFT

	err = cudaMemcpy(dataOut, gpu_transformed_array, NX * NY * sizeof(cufftDoubleComplex), cudaMemcpyDeviceToHost); ERROR_HANDLER_CUDA

	cudaFree(gpu_initial_array);
	cudaFree(gpu_transformed_array);
	cufftDestroy(plan);

	return true;
}

And I use this one to print out my results :

void Application::printComplexMatrix(const cufftDoubleComplex* data, int cols, int rows, bool cufft_symmetry)
{
	int sym_cols = cols;

	if (cufft_symmetry) sym_cols = cols / 2 + 1;

	printf("Real Part: \n");
	for (int i = 0; i < rows; i++) 
	{
		for (int j = 0; j < cols; j++)
			if (j >= sym_cols)
				printf("%f ", data[i * sym_cols + (cols - j)].x);
			else
				printf("%f ", data[i * sym_cols + j].x);
		printf("\n");
	}

	printf("Imag Part: \n");
	for (int i = 0; i < rows; i++) 
	{
		for (int j = 0; j < cols; j++)
			if (j >= sym_cols)
				printf("%f ", -data[i * sym_cols + (cols - j)].y); // complex (hermitian) symmetry
			else
				printf("%f ", data[i * sym_cols + j].y);
		printf("\n");
	}
}

Can someone help me with that ?

My wrong results on the five first elements with launchFFT_R2C() function :

Real Part:
7832466.053380 336112.003120 -241287.237986 156417.233556 -64705.152967
-1297.465045 36688.280115 -39401.822953 22499.600954 114.471259
-15039.533396 17345.256324 -9181.924529 -2246.558350 9686.908870
-10097.060700 4820.556155 2492.306439 -6415.139433 5468.776117
-1167.533065 -3182.003654 5014.081975 -2978.421501 -416.719322
Imag Part:
0.000000 1583.099541 -2273.031640 2210.305203 -1219.242600
-30.616769 1037.027652 -1299.636249 848.103546 4.755247
-709.002702 899.346736 -519.653047 -137.874700 639.530842
-714.718918 363.790663 199.798430 -545.387079 490.502953
-110.532166 -315.999213 521.184674 -324.179986 -47.576409
Appuyez sur une touche pour continuer...

The good results on the five first elements (from matlab) :

7.8284e+06+0i | 3.3606e+05+1055.8i | -2.4121e+05-1515.6i | 1.5629e+05+1473i | -64547-811.17i

3.3606e+05+1055.8i | -2.9203e+05-1834.9i | 2.2444e+05+2115.3i | -1.3986e+05-1757.6i | 55427+870.72i

-2.4121e+05-1515.6i | 2.2444e+05+2115.3i | -1.7112e+05-2150.4i | 98701+1550.5i | -30268-570.61i

1.5629e+05+1473i | -1.3986e+05-1757.6i | 98701+1550.5i | -46815-882.54i | -2167.2-47.667i

-64547-811.17i | 55427+870.72i | -30268-570.61i | -2167.2-47.667i | 28277+710.83i

Anyone ?

This example may be of interest:

https://devtalk.nvidia.com/default/topic/826819/2d-cufft-wrong-result/

If you don’t provide a complete code that makes it easy for others to test your code, in my experience you’re less likely to get help.

Thank you for your help but I have already tried this. I will try again maybe I missed something.

In the real-to-complex case, when you are printing the matrix with your function, the FFT data has symmetry so you should be passing true here instead of false:

printComplexMatrix(champFFT, Nx, Ny, 5, false);
                                        ^^^^^

Other than that, when I build a small test case around your code, it seems to produce the same results as in the thread I linked:

$ cat t296.cu
#include <cufft.h>
#include <stdio.h>
#include <assert.h>

__host__ bool launchFFT_R2C(const cufftDoubleReal * dataIn, cufftDoubleComplex * dataOut, int col, int row)
{
        cufftResult_t result;
        cufftHandle plan;

        cufftDoubleReal *d_idata;
        cufftDoubleComplex *d_odata;

        cudaMalloc(&d_idata, row * col * sizeof(cufftDoubleReal));
        cudaMalloc(&d_odata, row * col * sizeof(cufftDoubleComplex));

        cudaMemcpy(d_idata, dataIn, row * col * sizeof(cufftDoubleReal), cudaMemcpyHostToDevice);

        result = cufftPlan2d(&plan, row, col, CUFFT_D2Z);
        assert(result == CUFFT_SUCCESS);
        result = cufftExecD2Z(plan, d_idata, d_odata);
        assert(result == CUFFT_SUCCESS);

        cudaMemcpy(dataOut, d_odata, sizeof(cufftDoubleComplex) * row * col, cudaMemcpyDeviceToHost);

        cudaFree(d_idata);
        cudaFree(d_odata);
        cufftDestroy(plan);

        return true;
}

void printComplexMatrix(const cufftDoubleComplex* data, int cols, int rows, int limit, bool cufft_symmetry)
{
        if (limit > 0)
        {
                if (cols != rows && rows == 1)
                {
                        cols = limit;
                }
                else
                {
                        cols = limit;
                        rows = limit;
                }
        }

        int sym_cols = cols;
        if (cufft_symmetry) sym_cols = cols / 2 + 1;

        printf("Real Part: \n");

        for (int i = 0; i < rows; i++)
        {
                for (int j = 0; j < cols; j++)
                        if (j >= sym_cols)
                                printf("%f ", data[i * sym_cols + (cols - j)].x);
                        else
                                printf("%f ", data[i * sym_cols + j].x);
                printf("\n");
        }

        printf("Imag Part: \n");
        for (int i = 0; i < rows; i++)
        {
                for (int j = 0; j < cols; j++)
                        if (j >= sym_cols)
                                printf("%f ", -data[i * sym_cols + (cols - j)].y); // complex (hermitian) symmetry
                        else
                                printf("%f ", data[i * sym_cols + j].y);
                printf("\n");
        }
}
void Print2DComplex(int rows, int cols, cufftDoubleComplex *data, bool cufft_symmetry = false){

        int sym_cols = cols;
        if (cufft_symmetry) sym_cols = cols/2 + 1;
        printf("Real Part: \n");
        for (int i = 0; i < rows; i++){
          for (int j = 0; j < cols; j++)
            if (j>=sym_cols)
              printf("%f ", data[i*sym_cols+(cols-j)].x);
            else
              printf("%f ", data[i*sym_cols+j].x);
          printf("\n");}
        printf("Imag Part: \n");
        for (int i = 0; i < rows; i++){
          for (int j = 0; j < cols; j++)
            if (j>=sym_cols)
              printf("%f ", -data[i*sym_cols+(cols-j)].y); // complex (hermitian) symmetry
            else
              printf("%f ", data[i*sym_cols+j].y);
          printf("\n");}
}



int main(){
        const int row = 4;
        const int col = 4;
        int Nx = col;
        int Ny = row;
        cufftDoubleReal A[row][col] =
        {{ 1, 2, 3, 4},
         { 5, 6, 7, 8},
         { 9,10,11,12},
         {13,14,15,16}};
        cufftDoubleReal *d_idata;
        cufftDoubleComplex *d_odata;
        cudaMalloc((void**)&d_idata, sizeof(cufftDoubleComplex)*row*col);
        cudaMalloc((void**)&d_odata, sizeof(cufftDoubleComplex)*row*col);
        cudaMemset(d_idata, 0, sizeof(cufftDoubleComplex)*row*col);
        cudaMemset(d_odata, 0, sizeof(cufftDoubleComplex)*row*col);

cufftDoubleReal* champ = &(A[0][0]);

cufftDoubleComplex* champFFT = (cufftDoubleComplex*) malloc(Nx * Ny * sizeof(cufftDoubleComplex));

if (!launchFFT_R2C(champ, champFFT, Nx, Ny))
{
        printf("[ERROR] Unable to perform fft() !\n");
        return EXIT_FAILURE;
}

printComplexMatrix(champFFT, Nx, Ny, 4, true);
Print2DComplex(row,col,champFFT, true);
}
$ nvcc -o t296 t296.cu -lcufft
$ cuda-memcheck ./t296
========= CUDA-MEMCHECK
Real Part:
136.000000 -8.000000 -8.000000 -8.000000
-32.000000 -0.000000 0.000000 -0.000000
-32.000000 -0.000000 0.000000 -0.000000
-32.000000 -0.000000 -0.000000 -0.000000
Imag Part:
-0.000000 8.000000 0.000000 -8.000000
32.000000 0.000000 0.000000 -0.000000
-0.000000 0.000000 0.000000 -0.000000
-32.000000 -0.000000 0.000000 0.000000
Real Part:
136.000000 -8.000000 -8.000000 -8.000000
-32.000000 -0.000000 0.000000 -0.000000
-32.000000 -0.000000 0.000000 -0.000000
-32.000000 -0.000000 -0.000000 -0.000000
Imag Part:
-0.000000 8.000000 0.000000 -8.000000
32.000000 0.000000 0.000000 -0.000000
-0.000000 0.000000 0.000000 -0.000000
-32.000000 -0.000000 0.000000 0.000000
========= ERROR SUMMARY: 0 errors
$

If you still need help, I suggest providing a complete test case. I have no idea how to interpret a 5x5 section of your output matrix for a 2000x2000 transform.

Thank you for your help I get the good results now !

You are right I just missed to put cufft_symmetry at true :/

I tried in the same way on a litle matrix 4x4 :

1.5 2.0 9.9 5.0
3.0 1.0 9.0 7.3
6.9 1.1 4.0 3.0
4.0 2.9 3.0 1.0

And I got the good results below with fft2() on Scilab :

64.6 | -10.5 + 9.3i | 18. | -10.5 - 9.3i
3.4 - 9.4i | -3.1 + 8.1i | -2.4 - 0.6i | -19.5 + 5.9i
2.2 | -0.5 + 0.5i | 4.4 | -0.5 - 0.5i
3.4 + 9.4i | -19.5 - 5.9i | -2.4 + 0.6i | -3.1 - 8.1i

But with my cufft function the results are different on the last columns :

Real Part:
64.600000 -10.500000 18.000000 -10.500000
3.400000 -3.100000 -2.400000 -3.100000
2.200000 -0.500000 4.400000 -0.500000
3.400000 -19.500000 -2.400000 -19.500000
Imag Part:
-0.000000 9.300000 0.000000 -9.300000
-9.400000 8.100000 -0.600000 -8.100000
-0.000000 0.500000 0.000000 -0.500000
9.400000 -5.900000 0.600000 5.900000

It seems that the results are just not ordered correctly but I don’t know why because the exemple with your matrix works very well:

{{ 1, 2, 3, 4},
{ 5, 6, 7, 8},
{ 9,10,11,12},
{13,14,15,16}};

I used launchFFT_R2C() and printComplexMatrix() with boolean to true like above.

My function launchFFT_C2C() doesn’t work too and the output is completly wrong.

I use CUDA 9.1

Anyone ?

The solution : https://devtalk.nvidia.com/default/topic/826819/2d-cufft-wrong-result/ it’s the same issue