CUFFT appears to give errors for vectors > 1024

e.ping · March 21, 2007, 1:31pm

The following possible bug was discovered in the cufft library.

1d and 2d ffts seems produce erroneous results when the data in one of the dimensions has a size larger than 1024. We compared the results with two versions of the FFT
i) on an Intel CPU (a Radix-4 code we have used for many years implemented in Fortran, tested in single and double precision;
ii) the FFTW library called from Matlab 7.4)

To illustrate the bug, we present the following example

The input was created in the form of complex arrays with input data given by

(1+i), 2*(1+i),…, fftsize*(1+i)

Here i is the sqrt(-1), and fftsize is the length of the vector (nx) in 1d or product of dimensions (nx*ny) in 2d

Results for transforms of sizes 1024 or smaller are quite close between the various implementations,
e.g. in 2d for size nx=1024, ny=1024 we have the following values for real and complex parts of the output vector (first 10 entries):

CUFFT:

5.497563e+011 -5.497563e+011
1.703664e+008 1.714151e+008
8.492026e+007 8.596888e+007
5.643785e+007 5.748647e+007
4.219638e+007 4.324498e+007
3.365129e+007 3.469987e+007
2.795437e+007 2.900297e+007
2.388499e+007 2.493359e+007
2.083283e+007 2.188141e+007
1.845880e+007 1.950738e+007

FFTW (Matlab):

5.497563381760000e+011 -5.497563381760000e+011i
1.703664947286252e+008 +1.714150707286252e+008i
8.492029911407000e+007 +8.596887511406998e+007i
5.643787646018004e+007 +5.748645246018005e+007i
4.219639704140969e+007 +4.324497304140969e+007i
3.365129491103461e+007 +3.469987091103461e+007i
2.795438141880073e+007 +2.900295741880073e+007i
2.388500428487378e+007 +2.493358028487377e+007i
2.083283736832681e+007 +2.188141336832681e+007i
1.845881058996932e+007 +1.950738658996932e+007i

RADIX4 (Fortran):

0.5497563E+12 -0.5497563E+12
0.1703665E+09 0.1714151E+09
0.8492030E+08 0.8596887E+08
0.5643788E+08 0.5748645E+08
0.4219640E+08 0.4324497E+08
0.3365130E+08 0.3469987E+08
0.2795438E+08 0.2900296E+08
0.2388500E+08 0.2493358E+08
0.2083284E+08 0.2188141E+08
0.1845881E+08 0.1950739E+08

For transforms of size nx=2048 ny=2048 the results are:

CUFFT:

8.794040e+012 -8.798141e+012
1.645764e+009 1.078914e+009
8.195489e+008 5.388700e+008
5.406539e+008 3.543549e+008
4.170834e+008 2.651674e+008
3.244111e+008 2.174269e+008
2.701094e+008 1.783274e+008
2.262236e+008 1.471788e+008
2.284106e+008 1.203969e+008
1.863923e+008 1.344119e+008

FFTW (Matlab):

8.796095119360000e+012 -8.796095119360000e+012i
1.365032326822403e+009 +1.369226630822403e+009i
6.814659789145010e+008 +6.856602829145010e+008i
4.536098147222018e+008 +4.578041187222018e+008i
3.396811964562800e+008 +3.438755004562799e+008i
2.713235965580829e+008 +2.755179005580829e+008i
2.257515058407202e+008 +2.299458098407202e+008i
1.931997060806620e+008 +1.973940100806620e+008i
1.687855881656387e+008 +1.729798921656387e+008i
1.497965914772740e+008 +1.539908954772740e+008i

RADIX4 (Fortran):

0.8796095E+13 -0.8796095E+13
0.1365032E+10 0.1369227E+10
0.6814660E+09 0.6856603E+09
0.4536098E+09 0.4578041E+09
0.3396812E+09 0.3438755E+09
0.2713236E+09 0.2755179E+09
0.2257515E+09 0.2299458E+09
0.1931997E+09 0.1973940E+09
0.1687856E+09 0.1729799E+09
0.1497966E+09 0.1539909E+09

A run over a range of sizes for 2d FFT is shown below. Observe that the CUFFT and the radix-4 single precision code give similar results till we cross nx=ny=1024:

nx=ny relative error (CUFFT/RADIX4)

16   	0.331E-06
32     	0.393E-06
64     	0.364E-06

128 0.538E-06
256 0.409E-06
512 0.526E-06
1024 0.534E-06
2048 0.329E+00 (!)
4096 0.658E+00 (!)

For those that care here is the CUDA code (for 1024x1024)

CUDA CODE:

// includes, system
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

// includes, project
#include <cufft.h>
#include <cutil.h>

// Complex data type
typedef float2 Complex;

// declaration, forward
void runTest(int argc, char** argv);

#define FFT_SIZEX 1024
#define FFT_SIZEY 1024

////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int main(int argc, char** argv)
{
runTest(argc, argv);

CUT_EXIT(argc, argv);

}

void runTest (int argc, char** argv)
{
CUT_CHECK_DEVICE();

unsigned int fftsize=FFT_SIZEX*FFT_SIZEY;
unsigned int mem_size=sizeof(Complex)*fftsize;

// Allocate host memory for the signal
Complex* h_fft = (Complex*)malloc(mem_size);
// Initalize the memory for the signal
for (unsigned int i = 0; i < fftsize; ++i) {
    h_fft[i].x = (float)(i)+1.0;
    h_fft[i].y = -(float)(i)-1.0;
}
    
// Allocate device memory for signal
Complex* d_fft;
CUDA_SAFE_CALL(cudaMalloc((void**)&d_fft,mem_size));
// Copy host memory to device
CUDA_SAFE_CALL(cudaMemcpy(d_fft, h_fft, mem_size,
                          cudaMemcpyHostToDevice));

// CUFFT plan
cufftHandle plan;
CUFFT_SAFE_CALL(cufftPlan2d(&plan, FFT_SIZEY, FFT_SIZEX, CUFFT_DATA_C2C));

// Transform signal 
CUFFT_SAFE_CALL(cufftExecute(plan, d_fft, d_fft, CUFFT_FORWARD));

// Copy device memory to host
CUDA_SAFE_CALL(cudaMemcpy(h_fft, d_fft, mem_size,
                          cudaMemcpyDeviceToHost));
                          
// print the result
 for (unsigned int i = 0; i < 10; ++i) {
    printf("%e   %e \n", h_fft[i].x, h_fft[i].y);
}
                      
  // cleanup memory
free(h_fft);
CUDA_SAFE_CALL(cudaFree(d_fft));

}

Mark_Harris · March 22, 2007, 12:15pm

Hi,

We have a bug filed for this, thanks for reporting it. We’ll fix it for the next release.

Mark

jmalcolm · March 23, 2007, 3:54am

Hi Fantalgo,

could you please also post the computation time for the various sizes of the FFT?

Regards,

John

mfatica · March 27, 2007, 7:46pm

The bug has been fixed, thanks for reporting it.

A fix will be available in the next release.

e.ping · March 29, 2007, 6:57pm

Here you go, here are the times we see for 2D (N x N) single precisions FFTs

N GPU (s)

16 1.50E-04

32 1.48E-04

64 1.49E-04

128 1.54E-04

256 3.10E-04

512 9.29E-04

1024 3.28E-03

BonsaiScott · April 12, 2007, 1:39pm

mfatica - can you give some idea on performance delta between current and next release of cuFFT? Ballpark is OK - roughly similar, somewhat faster, wildly faster?

mfatica · April 12, 2007, 2:48pm

This bug was related to accuracy.

The library is being optimized all the time, but there are a lot of different paths in the library, so I would say that it is going to be from roughly similar to somewhat faster depending on the particular input size.

Topic		Replies	Views
CUFFT run wrong CUDA Programming and Performance	16	2806	May 23, 2013
size limit of 1D FFT CUDA Programming and Performance	8	2526	September 24, 2011
accuracy of CUFFT under double precision CUDA Programming and Performance	9	4143	September 18, 2009
Bad Performance of CUFFT library? compilation flags for optimizing fft performance CUDA Programming and Performance	11	13486	February 17, 2012
2D CUFFT wrong result GPU-Accelerated Libraries cufft	8	3071	November 7, 2023
CUFFT issue CUDA Programming and Performance	0	1105	December 29, 2009
CUFFT and FFTW Numeric Accuracy CUDA Programming and Performance	9	20332	May 28, 2009
cuFFT 3.1 and data alignment with CUDA FFT library problem CUDA Programming and Performance	16	15831	August 24, 2010
CUFFT BENCHMARKING TOOL v1.0 8800GTX owners please post your results! CUDA Programming and Performance	46	94830	July 14, 2011
CUFFT with size 1 I guess this is a bug CUDA Programming and Performance	0	1076	September 28, 2008

CUFFT appears to give errors for vectors > 1024

Related topics