Problem with CUFFT

I’m trying to get a 2D FFT out of CUFFT, but it doesn’t seem to be working. The code I’m working with is below. If I comment out the two cufftExecute() lines, then the image will come back as it went in. If they run, however, then I get back a screen of noise with what looks vaguely like the original image smeared horizontally the whole way across. The original image (the input to cufftExecute()) also gets corrupted, and the transformed data looks like noise.

Note that the image is 1024 x 650, monochrome, and stored in an array of floats. Pixels are horizontally-sequential (incrementing the index moves right a pixel).

Can anyone tell me what I’m doing wrong here?

#include "/usr/local/cuda/include/cufft.h"

#include <math.h>

cufftComplex *d_ipt1;

cufftComplex *d_ipt1_fft;

cufftComplex *d_ipt2;

cufftComplex *d_ipt2_fft;

cufftComplex *d_output;

cufftComplex *d_output_fft;

cufftHandle plan;

int mem_size = sizeof(cufftComplex) * 665600;

/* ========================== */

extern "C" {

int IMCON_init() {

cufftResult myResult;

// Allocate device memory for input / output 

cudaMalloc((void**)&d_ipt1, mem_size);

cudaMalloc((void**)&d_ipt2, mem_size);

cudaMalloc((void**)&d_ipt1_fft, mem_size);

cudaMalloc((void**)&d_ipt2_fft, mem_size);

cudaMalloc((void**)&d_output, mem_size);

cudaMalloc((void**)&d_output_fft, mem_size);

// Create CUFFTplan object                   .

myResult = cufftPlan2d(&plan, 1024, 650, CUFFT_DATA_C2C);

if( myResult == CUFFT_SUCCESS ) { return 0; }

else { return 1; }

}

/*   =====================       */

void IMCON_quit() {

cufftDestroy( plan );

cudaFree( d_ipt1 );

cudaFree( d_ipt2 );

cudaFree( d_ipt1_fft );

cudaFree( d_ipt2_fft );

cudaFree( d_output );

cudaFree( d_output_fft );

}

/*      ========================         */

int screenFFT( float* input ) {

cufftResult myResult1;

cufftResult myResult2;

cufftComplex h_ipt1[ 665600 ];

for( int i = 0; i < 665600; i++ ) {

	h_ipt1[ i ][ 0 ] = input[ i ];

	h_ipt1[ i ][ 1 ] = 0;

	}

   // Copy host memory to device

    cudaMemcpy(d_ipt1, h_ipt1, mem_size, cudaMemcpyHostToDevice);

myResult1 = cufftExecute(plan, d_ipt1, d_ipt1_fft, CUFFT_FORWARD);

myResult2 = cufftExecute(plan, d_ipt1_fft, d_ipt1, CUFFT_INVERSE);

if( myResult1 != CUFFT_SUCCESS or myResult2 != CUFFT_SUCCESS ) { return 1; }

else{ return 0; }

}

/*     ==============================          */

void retrieveFFTmags( float* output ) {

cufftComplex h_opt1[ 665600 ];

cudaMemcpy(h_opt1, d_ipt1, mem_size, cudaMemcpyDeviceToHost);

for( int i = 0; i < 665600; i++ ) {

	output[ i ] = h_opt1[ i ][ 0 ];

	}

}

}

CUFFT returns non-normalized data, so you should divide the output by the number of input elements. Probably that helps.

for( int i = 0; i < 665600; i++ )

{

    output[ i ] = h_opt1[ i ][ 0 ] / 665600;

}

I’ve gotten to the point where I’m getting reasonable-looking transforms back (I had to do the normalization, and the dimensions in the plan were reversed).

I do have two (probably really stupid) questions, though:

Why is the DC component showing up at the four corners, with the higher frequencies near the center? This looks like the exact opposite of the image FFTs I’ve seen before, like the ones shown on this page:

[url=“Introduction to the Fourier Transform”]http://www.cs.unm.edu/~brayer/vision/fourier.html[/url]

Also, if I’m doing a 2D convolution, where should the division by N (the number of pixels) be? Should it be done to each FFT, or just once to the output (in this case, I’m assuming it would be N^2)?

[edit]

My convolutions are coming back with the same issue as the FFT >.<

The four quadrants of the output are out of order (upper left → lower right, etc.), but they aren’t flipped themselves. Does anyone know why this is happening?

Right, if you store images in “usual” layout: pixel(x, y) = data[y], then the dimensions in cufftPlan2d() should be reversed:

myResult = cufftPlan2d(&plan, 1024, 650, CUFFT_DATA_C2C); ==>

myResult = cufftPlan2d(&plan, 650, 1024, CUFFT_DATA_C2C);

first and second dimensions correspond not to X-Y dimensions in notation, common to graphics, but to the first and the second dimension of data[DIM_1][DIM_2] array, (corresponding to DIM_2 x DIM_1 image).

Data normalization depends on where you want it to be. Usually it’s done somewhere after direct FFTs and before final inverse FFT. For this case you just need to normalize once by the data element count.

As for “four quadrants” issue, spectre of discretized signal infinitely repeats with the period of discretization frequency. FFT samples X(n) (n = [0 … N - 1]) correspond to [0…1) domain of frequency values, normalized by the discretization frequency, not [-1/2, 1/2). Moreover, for real signal X(n) == X*(N - n).

In order to get “normal” form you just need to shift cyclically you spectre in both dimensions.

Next CUDA SDK release will contain FFT 2D convolution sample.

Hey, thanks for the help!

The dimensions thing makes sense when you describe it as pixel[ y ][ x ]. It might be good to mention this in the manual, though, as it’s fairly ambiguous on this point…

The four-quadrant problem was actually a really dumb mistake on my part–I had the kernel placed at the center of the array External Image . Remembering that FFT convolutions are circular, I moved it to 0, 0 and made it wrap to the other 3 corners, and that fixed it.

You’re welcome. :)

CUFFT API is modelled after FFTW – this is the only reason of such ordering. I agree, for people with graphics background and zero FFTW experience “nx - transform size in the X dimension \ ny - the transform size in the Y dimension” is very confusing.

You should now post your correct code for others to learn…

Iam new in cuda ,i have a vector such as vector=[1 2 3], i want to repeat each element in array such as
matrix=[1 1 1 2 2 2 3 3 3] , i want to repeat each element three time the kernel i have wrote repeat as the following matrix=[ 1 2 3 1 2 3 1 2 3]
global void repeat_kernel(int* matrix, int* dev_vector)
{
int i = threadIdx.x + blockIdx.xblockDim.x;
int j = threadIdx.y + blockIdx.y
blockDim.y;
int tid = j*num_col + i;

if (tid < row_num*num_col) {
	matrix[tid] = dev_vector[i%row_num];


}

}
can any one help me to modify kernel and repeat as the following [1 1 1 2 2 2 3 3 3]