3D FFT C2C result is different with matlab fftn

When I use 2D FFT and comparing the result with Matlab, the result is right. But I use 3D complex to complex transform, I got the wrong result.

The input is 342 data array: and has changed to complex data(imag=0)

x(:,:,1) =

     0     2     6    12
    20    30    42    56
    72    90   110   132

x(:,:,2) =

   156   182   210   240
   272   306   342   380
   420   462   506   552

I first use cufft c2c 3d :

#include "stdio.h"
#include "cuda_runtime.h"
#include "cufft.h"
#include "device_launch_parameters.h"

#define NDIM 3
#define NX 3
#define NY 4
#define NZ 2

int main()
{

	int N[3];
	N[0] = NX; N[1] = NY; N[2] = NZ;
	int LENGTH = N[0] * N[1] *N[2];
	cufftComplex *inputcccc = (cufftComplex*) malloc(LENGTH * sizeof(cufftComplex));
	cufftComplex *output_data = (cufftComplex*) malloc( LENGTH * sizeof(cufftComplex));

	int i;
	for (i = 0; i < LENGTH; i++) {
		inputcccc[i].x= i * i +i ;
		inputcccc[i].y=0;
	}

	cufftComplex *d_inputCom;
	cudaMalloc((void**) &d_inputCom, LENGTH * sizeof(cufftComplex));
	cudaMemcpy(d_inputCom, inputcccc, LENGTH * sizeof(cufftComplex),cudaMemcpyHostToDevice);

	cufftComplex *d_output ;
	cudaMalloc((void**) &d_output, LENGTH * sizeof(cufftComplex));

	cufftHandle plan1;

	cufftPlan3d(&plan1, N[0], N[1], N[2], CUFFT_C2C);
	cufftExecC2C(plan1, d_inputCom, d_inputCom, CUFFT_FORWARD);

	cudaMemcpy(output_data, d_inputCom, LENGTH * sizeof(cufftComplex), cudaMemcpyDeviceToHost);
	for (i = 0; i < LENGTH; i++) {
		printf("%f %f \n", output_data[i].x, output_data[i].y);
	}

}

The result is :

4600.000000 0.000000 
-288.000000 0.000000 
-528.000000 624.000000 
24.000000 -24.000000 
-576.000000 0.000000 
24.000000 0.000000 
-528.000000 -624.000000 
24.000000 24.000000 
-2048.000000 1773.619995 
96.000000 -55.425659 
81.148758 -302.851257 
0.000015 0.000000 
192.000000 -110.851257 
0.000000 0.000000 
302.851257 81.148743 
-0.000015 0.000000 
-2048.000000 -1773.619995 
96.000000 55.425659 
302.851257 -81.148743 
-0.000015 0.000000 
192.000000 110.851257 
0.000000 0.000000 
81.148758 302.851257 
0.000015 0.000000

But I use this code in Matlab :

zdim=2;
x=zeros(3,4,zdim);

for i=1:zdim
    for k=1:3
        for j=1:4
            index=(k-1)*4+(j-1) + (i-1)*3*4;
            x(k,j,i)= index*index+index;
        end
    end
end
fftn(x)

The first X-Y slice number result is :

ans(:,:,1) =

   1.0e+03 *

   4.6000 + 0.0000i  -0.2760 + 0.3000i  -0.2880 + 0.0000i  -0.2760 - 0.3000i
  -1.0880 + 0.7760i   0.0203 - 0.0757i   0.0480 - 0.0277i   0.0757 + 0.0203i
  -1.0880 - 0.7760i   0.0757 - 0.0203i   0.0480 + 0.0277i   0.0203 + 0.0757i

ans(:,:,2) =

   1.0e+03 *

  -3.4560 + 0.0000i   0.1440 - 0.1440i   0.1440 + 0.0000i   0.1440 + 0.1440i
   0.5760 - 0.3326i  -0.0000 + 0.0000i   0.0000 + 0.0000i   0.0000 + 0.0000i
   0.5760 + 0.3326i   0.0000 - 0.0000i   0.0000 + 0.0000i  -0.0000 - 0.0000i

Regardless of numerical accuracy, the second and fourth data in cufft result is :

in cufft result is :
-288 and 24
in Matlab is :
-276 and -276

This may be of interest:

https://devtalk.nvidia.com/default/topic/1031771/gpu-accelerated-libraries/questions-about-cufft-for-3d-matrix-arrayfire-/

Thank you very much, I should do with a “reversal” of NX and NZ. That solves my problem!