Cufft double precision

Hi,
I have a problem using Cufft with double precision numbers. When doing a forward transform followed by inverse transform, the output does not equal the original data…
However, when I use the following code with smaller numbers, it seems to be correct.
Any hint to where the error is would be great. Thank you in advance.

...
....


int N = number_of_points_in_directions_XYZ ;

cufftResult e;
cufftDoubleComplex *rho, *d_rho, *rho2;
cufftDoubleComplex *d_rho_complex_out;
cufftHandle fft_rho_forward_z2z;
cufftHandle fft_rho_inverse_Z2Z;

//malloc rho on host
rho = (cufftDoubleComplex*) malloc(sizeof(cufftDoubleComplex) *N*N*N );
rho2 = (cufftDoubleComplex*) malloc(sizeof(cufftDoubleComplex) *N*N*N );


//malloc rho device
if (cudaSuccess != cudaMalloc((void**)&d_rho,sizeof(cufftDoubleComplex)*N*N*N))
    fprintf(stderr, "allocation of d_rho on device failed\n");  
CudaTest("allocation of d_rho on device failed");

if (cudaSuccess != cudaMalloc((void**)&d_rho_complex_out,sizeof(cufftDoubleComplex)*N*N*N))
  fprintf(stderr, "malloc of d_rho_complex_out on device failed\n");  
CudaTest("malloc d_rho_complex_out on device failed"); 



//create forward plan
e = cufftPlan3d(&fft_rho_forward_z2z,N,N,N,CUFFT_Z2Z);
fprintf(stderr, "cufftPlan3d creation of fft_rho_forward_z2z failed with %s\n",_cudaGetErrorEnum(e)); 


//create inverse plan
e = cufftPlan3d(&fft_rho_inverse_Z2Z,N,N,N,CUFFT_Z2Z);
fprintf(stderr, "cufftPlan3d creation of fft_rho_inverse_Z2Z failed with %s\n",_cudaGetErrorEnum(e));




//fill rho on host
//rho and rho2 will hold the same data. 
for(int i = 0; i < N*N*N ; i++)
    {
      
       //mass is an array of doubles 
       rho[i].x = mass[i];//using e.g. 1.0 here will produce correct numbers (0.00) in the output.
       rho[i].y = 0.0;
       rho2[i]= rho[i];
   
    }


//copy rho to gpu;
if (cudaSuccess != cudaMemcpy(d_rho,rho, sizeof(cufftDoubleComplex)*N*N*N, cudaMemcpyHostToDevice))
    fprintf(stderr, "copying of rho to d_rho on device failed\n");  
CudaTest("copying rho to device failed"); 



//execute forward fft.
//result will be in rho_complex_out
e = cufftExecZ2Z(fft_rho_forward_z2z,d_rho,d_rho_complex_out,CUFFT_FORWARD);
fprintf(stderr, "cufftExecZ2Z of fft_rho_forward_z2z failed with %s\n",_cudaGetErrorEnum(e)); 

//destroy the forward plan.
cufftDestroy(fft_rho_forward_z2z); 


//execute inverse fft 
//result will be in d_rho 
e = cufftExecZ2Z(fft_rho_inverse_Z2Z,d_rho_complex_out,d_rho,CUFFT_INVERSE);
fprintf(stderr, "cufftExecZ2Z of fft_rho_inverse_Z2Z failed with %s\n",_cudaGetErrorEnum(e)); 

//destroy the inverse plan
cufftDestroy(fft_rho_inverse_Z2Z); 



//copy reult back to host
if (cudaSuccess != cudaMemcpy(rho,d_rho, sizeof(cufftDoubleComplex) * N*N*N, cudaMemcpyDeviceToHost))
     fprintf(stderr, "copying of d_rho on device to rho on host failed\n");  
CudaTest("copying of d_rho on device to rho on host failed"); 

//cleanup memory on decive
cudaFree(d_rho);
cudaFree(d_rho_complex_out);


//print difference betwee the first 32 numbers in source and output
for (int i = 0; i < 32; i++)
{ 
double ff2 = rho[i].x/(double)(N*N*N);
printf("%e\n",(ff2)-rho2[i].x);
}
...
....

An example of the output…

0.000000e+00
7.430939e+16
5.404320e+16
1.125900e+17
-4.728780e+16
9.007199e+15
-2.251800e+16
1.351080e+16
8.106479e+16
1.576260e+16
3.602880e+16

After testing the same procedure with floats (using CUFFTC2C…), the same output appears.
What am i doing wrong?

WIth CUFFT:
IFFT(FFT(A))=len(A)*A
so, you need to normalize.