output of CUFFT not centered like FFTW


I am doing a 1D FFT. I have the same input data as would go in FFTW, however, the return from CUFFT does not seem to be “aligned” the same was FFTW is. That is, In my FFTW code, I could calculate the center of the zero padding, then do some shifting to “left-align” all my data, and have trailing zeros.

In CUFFT, the result from the FFT is data that looks like it is the same, however, the zeros are not “centered” in the output, so the rest of my algorithm breaks. (The shifting to left-align the data still has a “gap” in it after the bad shift).

Can anyone give me any insight? I thought it had something to do with those compatibility flags, but even with cufftSetCompatibilityMode(plan, CUFFT_COMPATIBILITY_FFTW_ALL); I am still getting a bad result.


Heres a screenshot of the data of the first row. This is a plot of the magnitude of the data of the first row, right after the inverse FFT has been taken. On the left is CUFFT, on the right in FFTW

One suggestion is to input a signal with known transform ( for example sin or cos) and see where the non-zero values end up.
Could you post your code? Is your input data real or complex?

Hi, thanks for looking.

I did do a simple example like so (edited with new example):

complex<float> *input = (complex<float>*)fftwf_malloc(sizeof(fftwf_complex) * 100);

	complex<float> *output = (complex<float>*)fftwf_malloc(sizeof(fftwf_complex) * 100);

	fftwf_plan ifft;

	ifft = fftwf_plan_dft_1d(100, reinterpret_cast<fftwf_complex*>(input), 



	cufftComplex *inplace = (cufftComplex *)malloc(100*sizeof(cufftComplex));

	cufftComplex *d_inplace;

	cudaMalloc((void **)&d_inplace,100*sizeof(cufftComplex));

	for(int i = 0; i < 100; i++)


		inplace[i] = make_cuComplex(cos(.5*M_PI*i),sin(.5*M_PI*i));

		input[i] = complex<float>(cos(.5*M_PI*i),sin(.5*M_PI*i));


	cutilSafeCall(cudaMemcpy(d_inplace, inplace, 100*sizeof(cufftComplex), cudaMemcpyHostToDevice));

	cufftHandle plan;

	cufftPlan1d(&plan, 100, CUFFT_C2C, 1);

	cufftExecC2C(plan, d_inplace, d_inplace, CUFFT_INVERSE);

	cutilSafeCall(cudaMemcpy(inplace, d_inplace, 100*sizeof(cufftComplex), cudaMemcpyDeviceToHost));


This gave me a value of 100 for the magnitude of the 76th element when I dumped those to file, and when i tried it with the forward FFT, i got the 100 in the 26th element. Zeros every where else. Also, this was exactly the same between CUFFT and FFTW. Now I guess I am even more stumped as to why my other code isnt working.

Is that the original code that you were running?

There is nothing mathematically incorrect in the fact that the non-zero element would come out at different locations for forward and inverse (after you write down the DFT expression for the input signal).


Here is a small codfe I got by modifying the cuftt_library.pdf example. It takes a signal with the real part cos(i2pi/16) (zero imaginary part) and makes the Fourier transform. The transform has 2 points non-zero one at 16 and one at 240. The first half contains the values for positive k while the second half the negative k.
Heere is the output

data[15] 0.000056 0.000002
data[16] 128.000000 0.000179
data[17] -0.000069 0.000021
data[239] -0.000069 -0.000021
data[240] 128.000000 -0.000185
data[241] 0.000056 -0.000002

the rest is zero up to single precision.

The k is retrieved this way.

for i <=lx/2 k=i2pi/lx
for i>lx/2 k=(i-lx)2pi/lx
From my experience this is the also the layout of the FFTW library.


The code:
#include <stdio.h>
#include <math.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <cufft.h>

#define NX 256

int main(int argc, char *argv)
cufftHandle plan;
cufftComplex *devPtr;
cufftComplex data[NX];
int i;

/* source data creation */
    for(i=  0 ; i < NX ; i++){
            data[i].x =cos(i*(2*acos(-1.0f)/16));
            data[i].y = 0.0f;

/* GPU memory allocation */
    cudaMalloc((void**)&devPtr, sizeof(cufftComplex)*NX);

/* transfer to GPU memory */
    cudaMemcpy(devPtr, data, sizeof(cufftComplex)*NX, cudaMemcpyHostToDevice);

    /* creates 1D FFT plan */
    cufftPlan1d(&plan, NX, CUFFT_C2C,1);

    /* executes FFT processes */
    cufftExecC2C(plan, devPtr, devPtr, CUFFT_FORWARD);
/* transfer results from GPU memory */
    cudaMemcpy(data, devPtr, sizeof(cufftComplex)*NX, cudaMemcpyDeviceToHost);
    for(i = 0 ; i < NX ; i++){
            printf("data[%d] %f %f\n", i, data[i].x, data[i].y);

    return 0;


No, that was not the original code. That was just a contrived example to see if the FFTW and CUFFT code had matching output, which they did. For some reason they dont for my actual data though.

something is wrong fft always must be symetrical

try a fft of 2048 pt
input 1 2 3 4 5 … 2047 2048
you must have in output real
2098176 1024 (1023 time) -1024 (1024 time)