Bad output from 3D cufft C2R transform?

John_H · March 1, 2012, 11:44pm

I’m having a problem with 3D inverse dft’s using the C2R transform. If I use the full C2C transform and take only the real values, I get the expected results. However, if I use packed (n1 x n2 x (n3 / 2 + 1)) complex input and the C2R transform, I get incorrect results and there are some suggestions that out of bounds memory is being accessed. (No segmentation violations, but other somewhat random behavior). Specifically, I’m calling an inverse transform with input which is the complex transform of a 4x4x4 volume with each row of each plane containing [0 .707 1 .707]. I’ve verified that the packed, interleaved input appears correct, at least according to my understanding of the cufft and fftw documentation. With the C2C version, the test1 program below produces exactly the expected result:

Output:
0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7

0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7
0.0 0.7 1.0 0.7

but with the C2R version I get:

Output:
-0.2 0.5 0.8 0.5
0.0 0.7 1.0 0.7
0.2 1.0 1.2 1.0
0.0 0.7 1.0 0.7

-0.2 0.5 0.8 0.5
0.0 0.7 1.0 0.7
0.2 1.0 1.2 1.0
0.0 0.7 1.0 0.7

Notice that the even numbered rows are correct, but the odd numbered ones are not. The same results are obtained using either in-place or out-of-place transforms. (Code supplied below is for the in-place case).

This is using cuda 4.1 on RHEL6-64 6.2(2.6.32-220.4.2.el6.x86_64) with gcc 4.3.4 or 4.4.5 (both behave identically).
The system is an HP Pavillion with 12 GB of ram, 3.2 GHz 6-core I7 (970) processor, and a GT420. (Yes, I know that’s an underpowered card. I just haven’t gotten around to installing a better one yet.) The driver version is 295.20, although for various reasons our IT folks insist on installing via dkms rather running the Nvidia installer. So far, I haven’t seen any issues caused by this latter fact. I’m using a 3D grid, so I always compile with arch=compute_20 code=sm_20.

I’ve attached files to reproduce both of these cases. File test1.cpp is the common driver, while gpufft3C2R.cu and gpufft3C2C.cu are the bad and good cuda sources, respectively. I’d appreciate it greatly if anyone can point out whether I’ve made some error in the way I’m calling the C2R transform, or whether this suggests a bug in either the cufft library or the GT420 driver.

Thanks in advance!
gpufft3C2C.cu (4.26 KB)
gpufft3C2R.cu (4.56 KB)
test1.cpp (1.17 KB)

pasoleatis · March 2, 2012, 9:41am

If there is memory problem, cuda-memecheck will find it. If you do an inplace transform you need to get extra care about how you define the real array since it must have the same size, but the padding is not continuous.

John_H · March 2, 2012, 7:42pm

Thank you for your reply and suggestion.

No problems are reported by cuda-memcheck. I think I’ve handled the padding for the in-place case correctly, but it’s always possible that I’ve misinterpreted the documentation. In any event, both the in-place and out-of-place transforms return identical but incorrect results.

Regards,

-jh-

pasoleatis · March 3, 2012, 9:51am

When I started to use cufft I did simple tests.

make forward and back transforms
put a periodic function like sin or cos and make the forward transform. in this case the result is 0 expect for a few points. If the input is 2cos((2pi/8) *i) the forward transform should have 1 point equal to 1 at 8.
take a matrix which is 0 every one expect 1 point where it is equal to 1. Now take the forward transform and multiply with 2*(1-cos(kx), where kx is i2pi/Nx (Nx is the size in the x direction), then take the back transform. This is the equivalent of a discrete derivative. If the initial matrix is on some line

0 0 1 0 0 after the procedure you should get 0 1 -2 1 0

These are simple tests to figure out if the data is retrieved correctly in k space.

John_H · March 3, 2012, 8:32pm

Thanks once again, Pasoleatis.

Simple tests of that type are exactly what I have done. E.g. the specific case I’ve supplied code for contains a single, pure, x frequency component, and no y or z components. I’ve also done even simpler cases… e.g. when the real volume is all ones, has only a DC component. In all cases, I get back expected results using the full C2C transform and incorrect results using C2R. So either I’m doing something incorrectly, or the 3D C2R code in the library is broken.

Regards,

-jh-

pasoleatis · March 3, 2012, 8:46pm

I do not think the library is broken. I am have ben using the cufft with D2Z (double to double complex and back) without problem with both 4.0 and 4.1 toolkit in 2D and 3D. I think you are not retrieving the data correct. Just try the real to complex and then complex to real and see if you can get the correct result. I can give my code as well. Try the simplex test with a code written from scratch.

Here are some lines from my code:

const int lx=,ly=lx,lz=lx; //define some lx,ly,lz

   const int totsize=lx*ly*lz,totsize_pad=lx*ly*2*(lz/2+1),totsize_invspa=lx*ly*(lz/2+1); //define total size of the problem

   cufftDoubleReal *dpsi; //gpu pointer for both real and inverse space, inplace transforms

   static cufftDoubleReal hpsi[totsize_pad]; // the matrix on the host

cufftHandle prc,pcr; 

   cufftPlan3d(&prc,lx,ly,lz,CUFFT_D2Z); // make the plans

   cufftPlan3d(&pcr,lx,ly,lz,CUFFT_Z2D);

cudaMalloc((void**)&dpsi, sizeof(cufftDoubleReal)*totsize_pad) //allocate the memory on the gpu

// Initialise the matrix on the host in real space, similar line are used to save the data 

int count=0;

for(int i=0;i<lx;i++)

    { 

    for(int j=0;j<ly;j++)   

    {

    for(int k=0;k<2*(lz/2+1);k++) // k goes beyond lz because of the padding

    {   

    if(k<lz)

    { 

hpsi[count]=(cos(2*Pi*i*dx/abcc); // some initialising depending n i,j,k 

}

    else

    {

    hpsi[count]=0; //these are padded memory not used in real space

    }

    count=count+1;  

    }}} 

    cudaMemcpy(dpsi, hpsi, sizeof(double)*totsize_pad,cudaMemcpyHostToDevice) ; // copy tp device  

// Code to create the laplacian (\nabla^2) in inverse space (-\vec k^2)

    count=0;

    for(int i=0;i<lx;i++)

    {   

    if(i<=lx/2)

    {kx=i*2*Pi/((double)lx*dx);}

    else   

    {kx=(i-lx)*2*Pi/((double)lx*dx);}   

    for(int j=0;j<ly;j++)   

    {   

    if(j<=ly/2)   

    {ky=j*2*Pi/((double)ly*dy);}   

    else   

    {ky=(j-ly)*2*Pi/((double)ly*dy);}

    for(int k=0;k<=lz/2;k++)

    {

    kz=k*2*Pi/((double)lz*dz);

hqq[count]=-(kx*kx+ky*ky+kz*kz)/(kkmax*kkmax);

count=count+1;

    }}}

cudaMemcpy(dqq, hqq, sizeof(double)*totsize_invspa,cudaMemcpyHostToDevice);

cufftExecD2Z(prc,dpsi,(cufftDoubleComplex*)dpsi); // double to double complex transform, inplace

// here do something 

cufftExecZ2D(pcr,(cufftDoubleComplex*)dpsi, dpsi); // double complex to complex tranform, inplace

//here copy the data back to host

cudaMemcpy(hpsi, dpsi, sizeof(double)*totsize_pad,cudaMemcpyDeviceToHost) ;

// Save the data 

count=0;

for(int i=0;i<lx;i++)

    { 

    for(int j=0;j<ly;j++)   

    {

    for(int k=0;k<2*(lz/2+1);k++) // k goes beyond lz because of the padding

    {   

    if(k<lz)

    { 

// print hpsi[count]

}

    else

    {

    // skip the padding

    }

    count=count+1;  

    }}}

I can provide the whole code just to see that if it works, but it requires double precision. It does thousands of transform so it would come out if there is problem

John_H · March 6, 2012, 4:15pm

Thanks for your sample code. I tried similar tests, which worked perfectly, and that finally got me looking in the right direction. The real problem was that my calling routine is a Matlab C+±mex file, and I wasn’t correctly accounting for the fact that the Matlab arrays are in column major-rather than row major-order even though they’re passed through a C++ gateway function.

Thanks once again for your help!

Regards,
-jh-

Topic		Replies	Views
Cufft_R2C and Cufft_C2R are inaccurate GPU-Accelerated Libraries	2	1745	April 11, 2014
2D CUFFT wrong result GPU-Accelerated Libraries cufft	8	3127	November 7, 2023
cuFFT cufftPlan1d and cufftExecR2C issues GPU-Accelerated Libraries	4	2401	July 13, 2016
Cufft 1D transform CUDA Programming and Performance	6	3024	January 8, 2015
CUFFT out-of-place transform destroys input? GPU-Accelerated Libraries cufft	4	2261	July 28, 2021
cuFFT and fftw CUDA Programming and Performance	10	4210	August 25, 2010
in-place FFT fails on 2D real transform CUDA Programming and Performance	0	833	August 13, 2009
cuFFT 1d inverse transform unexpected results GPU-Accelerated Libraries cufft	5	993	February 27, 2023
Wrong results produced by Cufft GPU-Accelerated Libraries	8	1115	October 4, 2018
cufft functionality CUDA Programming and Performance	7	1222	January 12, 2012

Bad output from 3D cufft C2R transform?

Related topics