Bad output from 3D cufft C2R transform?

I’m having a problem with 3D inverse dft’s using the C2R transform. If I use the full C2C transform and take only the real values, I get the expected results. However, if I use packed (n1 x n2 x (n3 / 2 + 1)) complex input and the C2R transform, I get incorrect results and there are some suggestions that out of bounds memory is being accessed. (No segmentation violations, but other somewhat random behavior). Specifically, I’m calling an inverse transform with input which is the complex transform of a 4x4x4 volume with each row of each plane containing [0 .707 1 .707]. I’ve verified that the packed, interleaved input appears correct, at least according to my understanding of the cufft and fftw documentation. With the C2C version, the test1 program below produces exactly the expected result:

Output:
0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7

0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7

0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7

0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7

but with the C2R version I get:

Output:
-0.2 0.5 0.8 0.5
0.0 0.7 1.0 0.7
0.2 1.0 1.2 1.0
0.0 0.7 1.0 0.7

-0.2 0.5 0.8 0.5
0.0 0.7 1.0 0.7
0.2 1.0 1.2 1.0
0.0 0.7 1.0 0.7

-0.2 0.5 0.8 0.5
0.0 0.7 1.0 0.7
0.2 1.0 1.2 1.0
0.0 0.7 1.0 0.7

-0.2 0.5 0.8 0.5
0.0 0.7 1.0 0.7
0.2 1.0 1.2 1.0
0.0 0.7 1.0 0.7

Notice that the even numbered rows are correct, but the odd numbered ones are not. The same results are obtained using either in-place or out-of-place transforms. (Code supplied below is for the in-place case).

This is using cuda 4.1 on RHEL6-64 6.2(2.6.32-220.4.2.el6.x86_64) with gcc 4.3.4 or 4.4.5 (both behave identically).
The system is an HP Pavillion with 12 GB of ram, 3.2 GHz 6-core I7 (970) processor, and a GT420. (Yes, I know that’s an underpowered card. I just haven’t gotten around to installing a better one yet.) The driver version is 295.20, although for various reasons our IT folks insist on installing via dkms rather running the Nvidia installer. So far, I haven’t seen any issues caused by this latter fact. I’m using a 3D grid, so I always compile with arch=compute_20 code=sm_20.

I’ve attached files to reproduce both of these cases. File test1.cpp is the common driver, while gpufft3C2R.cu and gpufft3C2C.cu are the bad and good cuda sources, respectively. I’d appreciate it greatly if anyone can point out whether I’ve made some error in the way I’m calling the C2R transform, or whether this suggests a bug in either the cufft library or the GT420 driver.

Thanks in advance!
gpufft3C2C.cu (4.26 KB)
gpufft3C2R.cu (4.56 KB)
test1.cpp (1.17 KB)

If there is memory problem, cuda-memecheck will find it. If you do an inplace transform you need to get extra care about how you define the real array since it must have the same size, but the padding is not continuous.

Thank you for your reply and suggestion.

No problems are reported by cuda-memcheck. I think I’ve handled the padding for the in-place case correctly, but it’s always possible that I’ve misinterpreted the documentation. In any event, both the in-place and out-of-place transforms return identical but incorrect results.

Regards,

-jh-

When I started to use cufft I did simple tests.

  1. make forward and back transforms

  2. put a periodic function like sin or cos and make the forward transform. in this case the result is 0 expect for a few points. If the input is 2cos((2pi/8) *i) the forward transform should have 1 point equal to 1 at 8.

  3. take a matrix which is 0 every one expect 1 point where it is equal to 1. Now take the forward transform and multiply with 2*(1-cos(kx), where kx is i2pi/Nx (Nx is the size in the x direction), then take the back transform. This is the equivalent of a discrete derivative. If the initial matrix is on some line

0 0 1 0 0 after the procedure you should get 0 1 -2 1 0

These are simple tests to figure out if the data is retrieved correctly in k space.

Thanks once again, Pasoleatis.

Simple tests of that type are exactly what I have done. E.g. the specific case I’ve supplied code for contains a single, pure, x frequency component, and no y or z components. I’ve also done even simpler cases… e.g. when the real volume is all ones, has only a DC component. In all cases, I get back expected results using the full C2C transform and incorrect results using C2R. So either I’m doing something incorrectly, or the 3D C2R code in the library is broken.

Regards,

-jh-

I do not think the library is broken. I am have ben using the cufft with D2Z (double to double complex and back) without problem with both 4.0 and 4.1 toolkit in 2D and 3D. I think you are not retrieving the data correct. Just try the real to complex and then complex to real and see if you can get the correct result. I can give my code as well. Try the simplex test with a code written from scratch.

Here are some lines from my code:

const int lx=,ly=lx,lz=lx; //define some lx,ly,lz

   const int totsize=lx*ly*lz,totsize_pad=lx*ly*2*(lz/2+1),totsize_invspa=lx*ly*(lz/2+1); //define total size of the problem

   cufftDoubleReal *dpsi; //gpu pointer for both real and inverse space, inplace transforms

   static cufftDoubleReal hpsi[totsize_pad]; // the matrix on the host

cufftHandle prc,pcr; 

   cufftPlan3d(&prc,lx,ly,lz,CUFFT_D2Z); // make the plans

   cufftPlan3d(&pcr,lx,ly,lz,CUFFT_Z2D);

cudaMalloc((void**)&dpsi, sizeof(cufftDoubleReal)*totsize_pad) //allocate the memory on the gpu

// Initialise the matrix on the host in real space, similar line are used to save the data 

int count=0;

for(int i=0;i<lx;i++)

    { 

    for(int j=0;j<ly;j++)   

    {

    for(int k=0;k<2*(lz/2+1);k++) // k goes beyond lz because of the padding

    {   

    if(k<lz)

    { 

hpsi[count]=(cos(2*Pi*i*dx/abcc); // some initialising depending n i,j,k 

}

    else

    {

    hpsi[count]=0; //these are padded memory not used in real space

    }

    count=count+1;  

    }}} 

    cudaMemcpy(dpsi, hpsi, sizeof(double)*totsize_pad,cudaMemcpyHostToDevice) ; // copy tp device  

// Code to create the laplacian (\nabla^2) in inverse space (-\vec k^2)

    count=0;

    for(int i=0;i<lx;i++)

    {   

    if(i<=lx/2)

    {kx=i*2*Pi/((double)lx*dx);}

    else   

    {kx=(i-lx)*2*Pi/((double)lx*dx);}   

    for(int j=0;j<ly;j++)   

    {   

    if(j<=ly/2)   

    {ky=j*2*Pi/((double)ly*dy);}   

    else   

    {ky=(j-ly)*2*Pi/((double)ly*dy);}

    for(int k=0;k<=lz/2;k++)

    {

    kz=k*2*Pi/((double)lz*dz);

hqq[count]=-(kx*kx+ky*ky+kz*kz)/(kkmax*kkmax);

count=count+1;

    }}}

cudaMemcpy(dqq, hqq, sizeof(double)*totsize_invspa,cudaMemcpyHostToDevice);

cufftExecD2Z(prc,dpsi,(cufftDoubleComplex*)dpsi); // double to double complex transform, inplace

// here do something 

cufftExecZ2D(pcr,(cufftDoubleComplex*)dpsi, dpsi); // double complex to complex tranform, inplace

//here copy the data back to host

cudaMemcpy(hpsi, dpsi, sizeof(double)*totsize_pad,cudaMemcpyDeviceToHost) ;

// Save the data 

count=0;

for(int i=0;i<lx;i++)

    { 

    for(int j=0;j<ly;j++)   

    {

    for(int k=0;k<2*(lz/2+1);k++) // k goes beyond lz because of the padding

    {   

    if(k<lz)

    { 

// print hpsi[count]

}

    else

    {

    // skip the padding

    }

    count=count+1;  

    }}}

I can provide the whole code just to see that if it works, but it requires double precision. It does thousands of transform so it would come out if there is problem

Thanks for your sample code. I tried similar tests, which worked perfectly, and that finally got me looking in the right direction. The real problem was that my calling routine is a Matlab C+±mex file, and I wasn’t correctly accounting for the fact that the Matlab arrays are in column major-rather than row major-order even though they’re passed through a C++ gateway function.

Thanks once again for your help!

Regards,
-jh-