error on 1D FFT batch mode under in-place transformation

Dear all:

I try to do 1D FFT R2C (real to complex) in-place (batch mode), but

result is wrong if batch > 1.

The following code do 1D R2C with batch = 2

[codebox]#include <stdio.h>

#include <assert.h>

#include <cufft.h>

#include <cutil_inline.h>

#ifdef DO_DOUBLE

typedef double  doublereal ;

typedef cufftDoubleComplex Complex; 

#else

typedef float   doublereal ;

typedef cufftComplex  Complex; 

#endif

int main( int argc, char* argv )

{

int i, j ;

int batch = 2 ; 

int n = 5 ;

doublereal *u = (doublereal *)malloc( sizeof(doublereal)*batch*n ) ;

assert( u ) ;

for(i = 0 ; i < batch*n ; i++){ u[i] = 0.0 ; }

for(j = 0 ; j < n ; j++){ u[j] = 1.0 ; }	

doublereal *ptr = u ;

for(i = 1 ; i <= batch ; i++){

	for( j = 0; j < n ; j++){

		printf("u(%d,%d) = %13.7E\n", i, j, *ptr );

		ptr++ ;

	}

}	

cufftHandle  plan ; 

#if defined (DO_DOUBLE)

cufftPlan1d(&plan, n, CUFFT_D2Z, batch );

#else

cufftPlan1d(&plan, n, CUFFT_R2C, batch );

#endif

Complex  *d_u ;

cutilSafeCall( cudaMalloc((void**)&d_u, sizeof(Complex)*batch*((n>>1)+1) ) );

CUDA_SAFE_CALL(cudaMemcpy( d_u, u, sizeof(doublereal)*batch*n , cudaMemcpyHostToDevice) );

#if defined (DO_DOUBLE)

cufftExecD2Z( plan, (cufftDoubleReal *)d_u, d_u );

#else

cufftExecR2C( plan,       (cufftReal *)d_u, d_u );

#endif

Complex *u_hat = (Complex *)malloc(sizeof(Complex)*batch*((n>>1)+1) );

CUDA_SAFE_CALL(cudaMemcpy( u_hat, d_u, sizeof(Complex)*batch*((n>>1)+1), cudaMemcpyDeviceToHost) );

Complex *qptr = u_hat ;

for(i = 1 ; i <= batch ; i++){

	for( j = 0; j <= (n>>1) ; j++){

		printf("cufft(u)(%d,%d) = (%13.7E, %13.7E)\n", i, j, qptr->x, qptr->y );

		qptr++ ;

	}

}		 

}

[/codebox]

the output is

u(1,0) = 1.0000000E+000

u(1,1) = 1.0000000E+000

u(1,2) = 1.0000000E+000

u(1,3) = 1.0000000E+000

u(1,4) = 1.0000000E+000

u(2,0) = 0.0000000E+000

u(2,1) = 0.0000000E+000

u(2,2) = 0.0000000E+000

u(2,3) = 0.0000000E+000

u(2,4) = 0.0000000E+000

cufft(u)(1,0) = (5.0000000E+000, 0.0000000E+000)

cufft(u)(1,1) = (0.0000000E+000, 0.0000000E+000)

cufft(u)(1,2) = (0.0000000E+000, 0.0000000E+000)

cufft(u)(2,0) = (8.0901700E-001, 0.0000000E+000)

cufft(u)(2,1) = (2.4999999E-001, 7.6942128E-001)

cufft(u)(2,2) = (-6.5450847E-001, 4.7552806E-001)

  1. first sequence u(1,0:4) has correct result

    cufft(u(1,0:4)) = [5 , 0, 0 ]

  2. second sequence u(2,0:4) is a zero vector,

    but its Fourier coefficent is not zero.

However if I use out-of-place, then it is O.K.

I think that this problem is related to my previous post

http://forums.nvidia.com/index.php?showtop…hl=in-place+FFT

since 2D FFT can be obtained by do two 1D FFT.

Does anyone have successful experience on “in-place R2C, batch mode” ?

ps: my platform is winxp pro64, vc2005, cuda 2.3, driver 190.38, GTX295

I found the problem, in CUFFT_Library_2.3.pdf, stride of batch mode is described as

"For in-place FFTs, the input stride is assumed to

be 2*(N/2+1) cufftReal elements or N/2+1 cufftComplex elements.

For out-of-place transforms, the input and output strides match the

logical transform size (N) and the non?redundant size (N/2+1),

respectively"

this means that we cannot put all sequence of input data into a contiguous block.

for n = 5, stride = 2*(n/2 + 1) = 6, suppose batch = 2 and input data is “u”, then

we put first sequence into

u[0] u[1] u[2] u[3] u[4]

and put second sequence into

u[6] u[7] u[8] u[9] u[10]

in my previous post, I put second sequence into

u[5] u[6] u[7] u[8] u[9]

this is wrong.

Also one need to copy host data into device memory sequence by sequence, part of code looks like

[codebox] int batch = 2 ;

size_t  n = 5 ;	

size_t  n_complex = (n>>1)+1 ;



....



cutilSafeCall( cudaMalloc((void**)&d_u, sizeof(Complex)*batch*n_complex ) );	

// copy host data to device memory sequence-by-sequence

CUDA_SAFE_CALL(cudaMemcpy( d_u, u, sizeof(doublereal)*n , cudaMemcpyHostToDevice) );

CUDA_SAFE_CALL(cudaMemcpy( d_u + n_complex, u + stride, sizeof(doublereal)*n , cudaMemcpyHostToDevice) );

#if defined (DO_DOUBLE)

cufftExecD2Z( plan, (cufftDoubleReal *)d_u, d_u );

else

cufftExecR2C( plan,       (cufftReal *)d_u, d_u );

endif

u_hat = (Complex *)malloc(sizeof(Complex)*batch*n_complex );	

CUDA_SAFE_CALL(cudaMemcpy( u_hat, d_u, sizeof(Complex)*batch*n_complex, cudaMemcpyDeviceToHost) );[/codebox]