cufftExecZ2Z result not match MATLAB result

So the MATLAB code here

t=[0:1/1000:1-1/1000];
j=sqrt(-1);
s=[10*exp(j*2*pi*100*t)];
A=exp(-j*2*pi*7.5*[0:63].'*sin(77*pi/180)/15);
data=A*s; 
data=data';
d0=data(:,1)';
d1=fft(d0);

and here is the first 20 of d1 on my pc

1.02318153949454e-11 - 3.89189389312207e-12i
3.96304793962431e-12 - 9.59627675712573e-12i
-1.22490611118041e-12 - 1.87442732754772e-12i
-2.62883065705508e-12 - 4.86408274878475e-12i
6.31721550091214e-12 + 3.25249180137728e-13i
-3.37521420326130e-12 - 7.87751355432440e-12i
4.41500716384229e-12 - 3.59597047697461e-12i
-7.07180311825956e-12 - 4.74217362079010e-13i
-5.62876972134328e-12 - 6.39511318817060e-12i
-5.38097000286798e-13 - 1.98707845052568e-12i
6.10931383113531e-13 + 2.44965382298964e-12i
-6.22210171250057e-12 + 5.28767466113619e-13i
-3.84916261802694e-12 + 3.11419120171075e-12i
-2.99812100496280e-12 + 1.23654974114445e-12i
5.19959329736483e-12 + 4.62825855394361e-12i
1.24055492666447e-12 + 6.60383790841987e-13i
3.42124542782838e-12 + 6.59616998606136e-13i
1.87649818762980e-12 - 2.23924607892215e-12i
2.60970084311677e-12 - 2.68843698366259e-13i
-5.41275677319708e-14 - 4.06094516348544e-12i

and here is my cuda code

#include <stdio.h>
#include <math.h>
#include <cufft.h>
#include <cuda_runtime.h>

#define PI 3.1415926

__managed__ cuDoubleComplex j;

__global__ void sGen(cuDoubleComplex* s, cuDoubleComplex* A, int M, int N);
__global__ void matMul(cuDoubleComplex* s, cuDoubleComplex* A, cuDoubleComplex* data, int N);

int main()
{
    j.x=0.0;
    j.y=1.0;

    cuDoubleComplex*  s;
    cudaMallocManaged((void **)&s, 1000 * sizeof(cuDoubleComplex));
    cuDoubleComplex*  A;
    cudaMallocManaged((void **)&A, 64 * sizeof(cuDoubleComplex));
    cuDoubleComplex*  data;
    cudaMallocManaged((void **)&data, 64000 * sizeof(cuDoubleComplex));

    sGen<<<1, 1000>>>(s, A, 64, 1000);
    cudaDeviceSynchronize();
    matMul<<<64, 1000>>>(s, A, data, 1000);
    cudaDeviceSynchronize();

    for(int n = 0 ; n < 20; n++)
    {
        printf("data[%d]: %2.7e\n", n, data[n].x);
        printf("data[%d]: %2.7e\n", n, data[n].y);
    }

    cufftHandle plan;
    int rank = 1;
    int n[1] ={1000};
    int istride = 1;
    int idist = 1000;
    int ostride =1;
    int odist = 1000;
    int inembed[2] = {1000, 64};
    int onembed[2] = {1000, 64};

    cufftPlanMany(&plan, rank, n, inembed, istride, idist, onembed, ostride, odist, CUFFT_Z2Z, 64);
    cufftExecZ2Z(plan, data, data, CUFFT_FORWARD);
    cudaDeviceSynchronize();

    for(int n = 0 ; n < 20; n++)
    {
        printf("data[%d]: %2.7e\n", n, data[n].x);
        printf("data[%d]: %2.7e\n", n, data[n].y);
    }

    return 0;
}

__global__ void sGen(cuDoubleComplex* s, cuDoubleComplex* A, int M, int N)
{
    int tid = threadIdx.x + blockDim.x * blockIdx.x;

    if(tid < N)
    {
        double t = 1.0 / 1000 * tid;
        double a1 = j.x * 2 * PI * 100 * t;
        double b1 = j.y * 2 * PI * 100 * t;

        double sin1, cos1;
        double e1 = exp(a1);
        sincos(b1, &sin1, &cos1);
        s[tid].x = 10 * cos1 * e1; 
        s[tid].y = 10 * sin1 * e1; 
    }
    if(tid < M)
    {
        double a1 = -j.x * 2 * PI * 7.5 * tid * sin(77.0 * PI / 180) / 15.0;
        double b1 = -j.y * 2 * PI * 7.5 * tid * sin(77.0 * PI / 180) / 15.0;

        double sin1, cos1;
        double e1 = exp(a1);
        sincos(b1, &sin1, &cos1);
        A[tid].x = cos1 * e1; 
        A[tid].y = sin1 * e1; 
    }
}

__global__ void matMul(cuDoubleComplex* s, cuDoubleComplex* A, cuDoubleComplex* data, int N)
{
    int tid = threadIdx.x + blockDim.x * blockIdx.x;
    int row = tid / N;
    int col = tid % N;
    data[tid].x = A[row].x * s[col].x - A[row].y * s[col].y;
    data[tid].y = A[row].x * s[col].y + A[row].y * s[col].x; 
}

and the part of the output of data is following,

data[0]: -1.6493214e-04
data[0]: 5.3590679e-05
data[1]: -1.6671242e-04
data[1]: 5.3590680e-05
data[2]: -1.6852784e-04
data[2]: 5.3590693e-05
data[3]: -1.7037949e-04
data[3]: 5.3590703e-05
data[4]: -1.7226850e-04
data[4]: 5.3590716e-05
data[5]: -1.7419606e-04
data[5]: 5.3590725e-05

or you can compile this code with

nvcc -o obj obj.cu -lcufft

I have verified the input data of fft already. It’s as same as MATLAB
It’s so strange that the result of cuda has so huge deviation against MATLAB code.
Is my cuda code wrong or the MATLAB bug?
Somebody help me please.

When I run your matlab code that is not what I get for d1. But it may not matter.

When I compare the results, I observe that in both results the value at index 100 is

1.0e4 + j…

all the rest of the results have exponents of -11 or -12. That means the difference in magnitude is on the order of 1e16. A double quantity can hold about 10-12 digits (significant figures). You have very large data being combined with much smaller data.

So mostly what you are looking at is some kind of noise floor. And very minor differences in the input data (there are some in the 5th decimal place and beyond) will affect the “noise” at the “noise floor”. But the only result that is in a calculable range matches between the two sets of results.

So you mean you tried my cuda code and its results show exponents of -11 or -12? But why mine are only -4 or -5? Or you modified some of this code.

Because in my next calculation used with fft result, like absolute value and sin cos stuff, the finalanswer is the same as MATLAB. So I wondered maybe it’s the expression bug of “printf” or something.

Sorry for my terrible english, if i got you right, in the last sentence you think that my fft code is right and the two sets of results are the same in a calculable range right?

I ran both your matlab code, as well as your CUDA/cufft code.

For both cases, I agree that the input data to the FFT is numerically close to each other. I saw differences in a few cases in the 4th or 5th decimal place.

Regarding the output data (which is what d1 is), I agree that the results are different, sometimes by orders of magnitude.

I also observed that there was no difference in the result (order of magnitude and to ~5 decimal places, at least) in the “spike” in your output data, which is at index 100 (100 for 0 based indexing, matlab uses 1-based indexing, so index 101 there). This single result point was, relative to the rest of the matlab output data which was at magnitude 1e-12 (approximately), about 16 orders of magnitude higher.

In the cufft result, this particular output point was exactly the same (magnitude, and to 5 decimal places) whereas the rest of the data was at about 1e-8.

So yes, the “noise floor” is higher in the cufft case. But the cufft input data differs in some cases from the matlab input data in the 4th or 5th decimal place. This could result in a difference that is 4 or 5 orders of magnitude difference in the “noise floor”.

There is certainly a lot of handwaving here. But I’m fairly confident that the problem here is your generation of the input data and interpretation of the output data, and not that there is some actual difference between matlab fft and cufft fft.

Yes, I acknowledge the results are different, I believe for the reasons I already stated.

I think I get your point now.

It’s the 4th or 5th decimal place differences of input data between matlab and cuda code that causes the strange output magnitude of cuda fft result. And index 100, the “spike” point, can show there is no differences between 2 kind of fft.

And I think your detailed explanation and my final correct data with the use of cufft results make me think there is no actual difference between matlab fft and cufft. Thanks a lot for the work you did.

Also, I wondered how you observed the output data of 1e-8, which is quoted in your last word of 5th paragraph.
Because from my test, also posted in the original thread, output data only show from 1e-4 or 1e-5.

Yes, I was unclear. My CUFFT results are mostly around magnitude of 1e-4 or 1e-5. What I was trying to communicate is that these results are approximately 8 orders of magnitude (1e-8) below the "spike"value at index 100 (MATLAB index 101).

If the CUFFT input data differs from the MATLAB input data in the 4th or 5th decimal places, and we are multiplying values together (as you would in a FFT butterfly) then in a hand-waving way I am suggesting that this could cause the CUFFT results to differ from the MATLAB results at about 8-10 orders of magnitude below the peak, which is what I am seeing.

For example on the MATLAB input data, at index 964, the real part is printed out as -3.0902. The CUFFT input data at index 963 for the real part is -3.0901. When you are looking at the results that are 8 to 16 orders of magnitude below the peak value, I think that could make a difference.

At least for the MATLAB input data, your real and imaginary coefficients seem to be a repeating sequence of just a handful of numbers. You might simply create input data on both sides that uses these numbers, rather than the calculation you have, to make the input data identical.

Even if you iron out the input differences, the order of floating point calculations between MATLAB and CUFFT will almost certainly not be the same, so differences are still possible. See here.

1 Like

Yeah I think you are right. I tried different angles(its 77 in cuda code) and every time the final result show right (which is not shown in cuda code).

This is just a simulation signal so it seems like a repeating sequence.

Thanks a lot for your patient guidance!