CUFFT appears to give errors for vectors > 1024

The following possible bug was discovered in the cufft library.

1d and 2d ffts seems produce erroneous results when the data in one of the dimensions has a size larger than 1024. We compared the results with two versions of the FFT
i) on an Intel CPU (a Radix-4 code we have used for many years implemented in Fortran, tested in single and double precision;
ii) the FFTW library called from Matlab 7.4)

To illustrate the bug, we present the following example

The input was created in the form of complex arrays with input data given by

(1+i), 2*(1+i),…, fftsize*(1+i)

Here i is the sqrt(-1), and fftsize is the length of the vector (nx) in 1d or product of dimensions (nx*ny) in 2d

Results for transforms of sizes 1024 or smaller are quite close between the various implementations,
e.g. in 2d for size nx=1024, ny=1024 we have the following values for real and complex parts of the output vector (first 10 entries):

CUFFT:

5.497563e+011 -5.497563e+011
1.703664e+008 1.714151e+008
8.492026e+007 8.596888e+007
5.643785e+007 5.748647e+007
4.219638e+007 4.324498e+007
3.365129e+007 3.469987e+007
2.795437e+007 2.900297e+007
2.388499e+007 2.493359e+007
2.083283e+007 2.188141e+007
1.845880e+007 1.950738e+007

FFTW (Matlab):

5.497563381760000e+011 -5.497563381760000e+011i
1.703664947286252e+008 +1.714150707286252e+008i
8.492029911407000e+007 +8.596887511406998e+007i
5.643787646018004e+007 +5.748645246018005e+007i
4.219639704140969e+007 +4.324497304140969e+007i
3.365129491103461e+007 +3.469987091103461e+007i
2.795438141880073e+007 +2.900295741880073e+007i
2.388500428487378e+007 +2.493358028487377e+007i
2.083283736832681e+007 +2.188141336832681e+007i
1.845881058996932e+007 +1.950738658996932e+007i

RADIX4 (Fortran):

0.5497563E+12 -0.5497563E+12
0.1703665E+09 0.1714151E+09
0.8492030E+08 0.8596887E+08
0.5643788E+08 0.5748645E+08
0.4219640E+08 0.4324497E+08
0.3365130E+08 0.3469987E+08
0.2795438E+08 0.2900296E+08
0.2388500E+08 0.2493358E+08
0.2083284E+08 0.2188141E+08
0.1845881E+08 0.1950739E+08

For transforms of size nx=2048 ny=2048 the results are:

CUFFT:

8.794040e+012 -8.798141e+012
1.645764e+009 1.078914e+009
8.195489e+008 5.388700e+008
5.406539e+008 3.543549e+008
4.170834e+008 2.651674e+008
3.244111e+008 2.174269e+008
2.701094e+008 1.783274e+008
2.262236e+008 1.471788e+008
2.284106e+008 1.203969e+008
1.863923e+008 1.344119e+008

FFTW (Matlab):

8.796095119360000e+012 -8.796095119360000e+012i
1.365032326822403e+009 +1.369226630822403e+009i
6.814659789145010e+008 +6.856602829145010e+008i
4.536098147222018e+008 +4.578041187222018e+008i
3.396811964562800e+008 +3.438755004562799e+008i
2.713235965580829e+008 +2.755179005580829e+008i
2.257515058407202e+008 +2.299458098407202e+008i
1.931997060806620e+008 +1.973940100806620e+008i
1.687855881656387e+008 +1.729798921656387e+008i
1.497965914772740e+008 +1.539908954772740e+008i

RADIX4 (Fortran):

0.8796095E+13 -0.8796095E+13
0.1365032E+10 0.1369227E+10
0.6814660E+09 0.6856603E+09
0.4536098E+09 0.4578041E+09
0.3396812E+09 0.3438755E+09
0.2713236E+09 0.2755179E+09
0.2257515E+09 0.2299458E+09
0.1931997E+09 0.1973940E+09
0.1687856E+09 0.1729799E+09
0.1497966E+09 0.1539909E+09

A run over a range of sizes for 2d FFT is shown below. Observe that the CUFFT and the radix-4 single precision code give similar results till we cross nx=ny=1024:

nx=ny relative error (CUFFT/RADIX4)

16   	0.331E-06
32     	0.393E-06
64     	0.364E-06

128 0.538E-06
256 0.409E-06
512 0.526E-06
1024 0.534E-06
2048 0.329E+00 (!)
4096 0.658E+00 (!)

For those that care here is the CUDA code (for 1024x1024)


CUDA CODE:

// includes, system
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

// includes, project
#include <cufft.h>
#include <cutil.h>

// Complex data type
typedef float2 Complex;

// declaration, forward
void runTest(int argc, char** argv);

#define FFT_SIZEX 1024
#define FFT_SIZEY 1024

////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int main(int argc, char** argv)
{
runTest(argc, argv);

CUT_EXIT(argc, argv);

}

void runTest (int argc, char** argv)
{
CUT_CHECK_DEVICE();

unsigned int fftsize=FFT_SIZEX*FFT_SIZEY;
unsigned int mem_size=sizeof(Complex)*fftsize;

// Allocate host memory for the signal
Complex* h_fft = (Complex*)malloc(mem_size);
// Initalize the memory for the signal
for (unsigned int i = 0; i < fftsize; ++i) {
    h_fft[i].x = (float)(i)+1.0;
    h_fft[i].y = -(float)(i)-1.0;
}
    
// Allocate device memory for signal
Complex* d_fft;
CUDA_SAFE_CALL(cudaMalloc((void**)&d_fft,mem_size));
// Copy host memory to device
CUDA_SAFE_CALL(cudaMemcpy(d_fft, h_fft, mem_size,
                          cudaMemcpyHostToDevice));

// CUFFT plan
cufftHandle plan;
CUFFT_SAFE_CALL(cufftPlan2d(&plan, FFT_SIZEY, FFT_SIZEX, CUFFT_DATA_C2C));

// Transform signal 
CUFFT_SAFE_CALL(cufftExecute(plan, d_fft, d_fft, CUFFT_FORWARD));

// Copy device memory to host
CUDA_SAFE_CALL(cudaMemcpy(h_fft, d_fft, mem_size,
                          cudaMemcpyDeviceToHost));
                          
// print the result
 for (unsigned int i = 0; i < 10; ++i) {
    printf("%e   %e \n", h_fft[i].x, h_fft[i].y);
}
                      
  // cleanup memory
free(h_fft);
CUDA_SAFE_CALL(cudaFree(d_fft));

}

Hi,

We have a bug filed for this, thanks for reporting it. We’ll fix it for the next release.

Mark

Hi Fantalgo,

could you please also post the computation time for the various sizes of the FFT?

Regards,

John

The bug has been fixed, thanks for reporting it.

A fix will be available in the next release.

Here you go, here are the times we see for 2D (N x N) single precisions FFTs

N GPU (s)

16 1.50E-04

32 1.48E-04

64 1.49E-04

128 1.54E-04

256 3.10E-04

512 9.29E-04

1024 3.28E-03

mfatica - can you give some idea on performance delta between current and next release of cuFFT? Ballpark is OK - roughly similar, somewhat faster, wildly faster?

This bug was related to accuracy.

The library is being optimized all the time, but there are a lot of different paths in the library, so I would say that it is going to be from roughly similar to somewhat faster depending on the particular input size.