The following possible bug was discovered in the cufft library.

1d and 2d ffts seems produce erroneous results when the data in one of the dimensions has a size larger than 1024. We compared the results with two versions of the FFT

i) on an Intel CPU (a Radix-4 code we have used for many years implemented in Fortran, tested in single and double precision;

ii) the FFTW library called from Matlab 7.4)

To illustrate the bug, we present the following example

The input was created in the form of complex arrays with input data given by

(1+i), 2*(1+i),…, fftsize*(1+i)

Here i is the sqrt(-1), and fftsize is the length of the vector (nx) in 1d or product of dimensions (nx*ny) in 2d

Results for transforms of sizes 1024 or smaller are quite close between the various implementations,

e.g. in 2d for size nx=1024, ny=1024 we have the following values for real and complex parts of the output vector (first 10 entries):

CUFFT:

5.497563e+011 -5.497563e+011

1.703664e+008 1.714151e+008

8.492026e+007 8.596888e+007

5.643785e+007 5.748647e+007

4.219638e+007 4.324498e+007

3.365129e+007 3.469987e+007

2.795437e+007 2.900297e+007

2.388499e+007 2.493359e+007

2.083283e+007 2.188141e+007

1.845880e+007 1.950738e+007

FFTW (Matlab):

5.497563381760000e+011 -5.497563381760000e+011i

1.703664947286252e+008 +1.714150707286252e+008i

8.492029911407000e+007 +8.596887511406998e+007i

5.643787646018004e+007 +5.748645246018005e+007i

4.219639704140969e+007 +4.324497304140969e+007i

3.365129491103461e+007 +3.469987091103461e+007i

2.795438141880073e+007 +2.900295741880073e+007i

2.388500428487378e+007 +2.493358028487377e+007i

2.083283736832681e+007 +2.188141336832681e+007i

1.845881058996932e+007 +1.950738658996932e+007i

RADIX4 (Fortran):

0.5497563E+12 -0.5497563E+12

0.1703665E+09 0.1714151E+09

0.8492030E+08 0.8596887E+08

0.5643788E+08 0.5748645E+08

0.4219640E+08 0.4324497E+08

0.3365130E+08 0.3469987E+08

0.2795438E+08 0.2900296E+08

0.2388500E+08 0.2493358E+08

0.2083284E+08 0.2188141E+08

0.1845881E+08 0.1950739E+08

For transforms of size nx=2048 ny=2048 the results are:

CUFFT:

8.794040e+012 -8.798141e+012

1.645764e+009 1.078914e+009

8.195489e+008 5.388700e+008

5.406539e+008 3.543549e+008

4.170834e+008 2.651674e+008

3.244111e+008 2.174269e+008

2.701094e+008 1.783274e+008

2.262236e+008 1.471788e+008

2.284106e+008 1.203969e+008

1.863923e+008 1.344119e+008

FFTW (Matlab):

8.796095119360000e+012 -8.796095119360000e+012i

1.365032326822403e+009 +1.369226630822403e+009i

6.814659789145010e+008 +6.856602829145010e+008i

4.536098147222018e+008 +4.578041187222018e+008i

3.396811964562800e+008 +3.438755004562799e+008i

2.713235965580829e+008 +2.755179005580829e+008i

2.257515058407202e+008 +2.299458098407202e+008i

1.931997060806620e+008 +1.973940100806620e+008i

1.687855881656387e+008 +1.729798921656387e+008i

1.497965914772740e+008 +1.539908954772740e+008i

RADIX4 (Fortran):

0.8796095E+13 -0.8796095E+13

0.1365032E+10 0.1369227E+10

0.6814660E+09 0.6856603E+09

0.4536098E+09 0.4578041E+09

0.3396812E+09 0.3438755E+09

0.2713236E+09 0.2755179E+09

0.2257515E+09 0.2299458E+09

0.1931997E+09 0.1973940E+09

0.1687856E+09 0.1729799E+09

0.1497966E+09 0.1539909E+09

A run over a range of sizes for 2d FFT is shown below. Observe that the CUFFT and the radix-4 single precision code give similar results till we cross nx=ny=1024:

nx=ny relative error (CUFFT/RADIX4)

```
16 0.331E-06
32 0.393E-06
64 0.364E-06
```

128 0.538E-06

256 0.409E-06

512 0.526E-06

1024 0.534E-06

2048 0.329E+00 (!)

4096 0.658E+00 (!)

For those that care here is the CUDA code (for 1024x1024)

CUDA CODE:

// includes, system

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

// includes, project

#include <cufft.h>

#include <cutil.h>

// Complex data type

typedef float2 Complex;

// declaration, forward

void runTest(int argc, char** argv);

#define FFT_SIZEX 1024

#define FFT_SIZEY 1024

////////////////////////////////////////////////////////////////////////////////

// Program main

////////////////////////////////////////////////////////////////////////////////

int main(int argc, char** argv)

{

runTest(argc, argv);

```
CUT_EXIT(argc, argv);
```

}

void runTest (int argc, char** argv)

{

CUT_CHECK_DEVICE();

```
unsigned int fftsize=FFT_SIZEX*FFT_SIZEY;
unsigned int mem_size=sizeof(Complex)*fftsize;
// Allocate host memory for the signal
Complex* h_fft = (Complex*)malloc(mem_size);
// Initalize the memory for the signal
for (unsigned int i = 0; i < fftsize; ++i) {
h_fft[i].x = (float)(i)+1.0;
h_fft[i].y = -(float)(i)-1.0;
}
// Allocate device memory for signal
Complex* d_fft;
CUDA_SAFE_CALL(cudaMalloc((void**)&d_fft,mem_size));
// Copy host memory to device
CUDA_SAFE_CALL(cudaMemcpy(d_fft, h_fft, mem_size,
cudaMemcpyHostToDevice));
// CUFFT plan
cufftHandle plan;
CUFFT_SAFE_CALL(cufftPlan2d(&plan, FFT_SIZEY, FFT_SIZEX, CUFFT_DATA_C2C));
// Transform signal
CUFFT_SAFE_CALL(cufftExecute(plan, d_fft, d_fft, CUFFT_FORWARD));
// Copy device memory to host
CUDA_SAFE_CALL(cudaMemcpy(h_fft, d_fft, mem_size,
cudaMemcpyDeviceToHost));
// print the result
for (unsigned int i = 0; i < 10; ++i) {
printf("%e %e \n", h_fft[i].x, h_fft[i].y);
}
// cleanup memory
free(h_fft);
CUDA_SAFE_CALL(cudaFree(d_fft));
```

}