accuracy of CUFFT under double precision

dear all:
I found that CUFFT (under double precision) is only accurate when size of FFT is power of 2

for example: consider N = 7
input data: x[j] = cos( 2pij/N ), j = 0, 1, 2, …, 6

x[0] = ( 1.000000000000000E+000, 0.000000E+000)
x[1] = ( 6.234898018587336E-001, 0.000000E+000)
x[2] = ( -2.225209339563143E-001, 0.000000E+000)
x[3] = ( -9.009688679024190E-001, 0.000000E+000)
x[4] = ( -9.009688679024192E-001, 0.000000E+000)
x[5] = ( -2.225209339563146E-001, 0.000000E+000)
x[6] = ( 6.234898018587334E-001, 0.000000E+000)

after using
cufftExecZ2Z( plan, x, y, CUFFT_FORWARD)

we have

y[0] = (-2.220446049250313E-016, 0.000000E+000)
y[1] = (3.500000000371433E+000, -4.653094E-016)
y[2] = (-2.656005562790720E-010, -2.129252E-017)
y[3] = (-1.058317940491451E-010, -9.278640E-018)
y[4] = (-1.058317940491451E-010, 9.278640E-018)
y[5] = (-2.656005562790720E-010, 2.129252E-017)
y[6] = (3.500000000371433E+000, 4.653094E-016)

but exact result of forward FFT is

y[1] = 3.5
y[6] = 3.5
y[j] = 0, otherwise

Hence CUFFT only has 10 digits accuracy in this case.

However if one tries N = 8, then fft(x) has 16 digits accuracy

also for single precision test, fft(x) has 7 digits accuracy even for N = 7

Why can not CUFFT have machine accuracy under double precision for size ~= 2^N ?

ps: I use cuda 2.3 on GTX 295
LungSheng

It is a known bug that has been fixed internally.
A new cufft library will be released shortly.

great, but I have another problem, performance of cuFFT on size not power of 2

I test 3D real FFT by using

method 1: use fortran F77 package (by Roland A. Sweet and Linda L. Lindgren )

  I convert it to C++ code by f2c and use Intel C++ compiler 11.1.035, cuda2.3

method 2: use cufftExecZ2Z or cufftExecC2C

platform: Q6600, 16GB RAM, GTX295, vc2005, icpc 11.1.035

procedure of experiment

step 1: random generate real data

step 2: copy data to cufftDoubleComplex / cufftComplex

step 3: transfer data from host to device

step 4: call cufftExecZ2Z / cufftExecC2C

step 5: transfer data from device to host

description of table entry:

h2d: host to device data transfer

d2h: device to host data transfer

CPU FFTF: forward FFT of CPU version, sequential

GPU FFTF: forward FFT due to cufftExecZ2Z or cufftExecC2C

time: round to ms (use QTime (QT class) to measure time )

[codebox]table 1: result of “double precision”

------------±-------±--------±---------±--------±--------+

        |        | CPU     |   GPU    |  GPU    |  GPU    |

N       | size   | FFTF    |   h2d    |  FFTF   |  d2h    |

------------±-------±--------±---------±--------±--------+

64,64,64 | 4MB | 47 ms | 141 ms | 0 ms | 15 ms |

------------±-------±--------±---------±--------±--------+

128,128,128 | 32MB | 359 ms | 172 ms | 16 ms | 15 ms |

------------±-------±--------±---------±--------±--------+

256,256,256 | 256MB | 2156 ms | 359 ms | 78 ms | 141 ms |

------------±-------±--------±---------±--------±--------+

255,255,255 | 253MB | 2735 ms | 359 ms | 1657 ms | 140 ms |

------------±-------±--------±---------±--------±--------+

300,300,300 | 412MB | 4250 ms | 500 ms | 2125 ms | 250 ms |

------------±-------±--------±---------±--------±--------+

310,310,310 | 455MB | 5953 ms | 515 ms | 4344 ms | 266 ms |

------------±-------±--------±---------±--------±--------+

340,340,340 | 600MB | 7265 ms | 656 ms | 2813 ms | 328 ms |

------------±-------±--------±---------±--------±--------+

table 2: result of “single precision”

------------±-------±--------±---------±--------±--------+

        |        | CPU     |   GPU    |  GPU    |  GPU    |

N       | size   | FFTF    |   h2d    |  FFTF   |  d2h    |

------------±-------±--------±---------±--------±--------+

64,64,64 | 2MB | 47 ms | 141 ms | 0 ms | 0 ms |

------------±-------±--------±---------±--------±--------+

128,128,128 | 16MB | 297 ms | 141 ms | 0 ms | 16 ms |

------------±-------±--------±---------±--------±--------+

256,256,256 | 128MB | 3000 ms | 235 ms | 15 ms | 78 ms |

------------±-------±--------±---------±--------±--------+

255,255,255 | 127MB | 2843 ms | 235 ms | 141 ms | 78 ms |

------------±-------±--------±---------±--------±--------+

300,300,300 | 206MB | 3687 ms | 313 ms | 156 ms | 125 ms |

------------±-------±--------±---------±--------±--------+

310,310,310 | 227MB | 5094 ms | 328 ms | 328 ms | 125 ms |

------------±-------±--------±---------±--------±--------+

340,340,340 | 300MB | 6172 ms | 390 ms | 375 ms | 172 ms |

------------±-------±--------±---------±--------±--------+[/codebox]

d2h reach maximum bandwidth (1,7GB/sec in my machine), I foucs on

timing of FFT kernel (CPU FFTF and GPU FFTF)

It is clear that when N is power of 2, even “double precision”,

cuFFT is 20 times faster than CPU version

however if N is not power of 2, then performance is dramatically slow down

and comparable to CPU version.

This means that if N is (255,255,255), then CPU FFT + openmp is better than cuFFT

Is this normal?

Will that mean a point release of CUDA (like CUDA 2.3.1) or just that the CUFFT will be updated and

CUDA 2.4 will be early? Since CUFFT is mixed in with the toolkit it doesn’t seem easy to just release

a new CUFFT by itself.

I install new release of toolkit 2.3, download from

http://forums.nvidia.com/index.php?showtop…rt=#entry577902

then accuracy of double precision in FFT is correct.

I take the same experiment (C2C out-of-place) for N = 7

input data:  x[j] = cos( 2*pi*j/N ), j = 0, 1, 2, ..., 6

cufftExecZ2Z( plan, x, y, CUFFT_FORWARD);

y[0] = (-2.220446049250313E-016, 0.0000000E+000)

y[1] = (3.500000000000000E+000, -4.6530940E-016)

y[2] = (3.722123552849509E-016, -2.1292515E-017)

y[3] = (1.033476309669928E-016, -9.2786395E-018)

y[4] = (1.033476309669928E-016, 9.2786395E-018)

y[5] = (3.722123552849509E-016, 2.1292515E-017)

y[6] = (3.500000000000000E+000, 4.6530940E-016)

where exact result of forward FFT is

y[1] = 3.5

y[6] = 3.5

y[j] = 0, otherwise

now CUFFT only has 15 digits accuracy in this case.

For error computations, it is better to divide by the element with the largest modulus (in this case 3.5) and hence the errors approach machine precision. However, due to round-off errors do not expect machine accuracy for large arrays in any floating-point computing system.

yes, you are right.

besides, accumulation of rounding error of inner product should be O(N)

but for FFT, it should be O(logN).

if I was wrong, then please correct me.

Without a formal error analysis (which could be difficult or impossible) your error estimates cannot be conformed. Furthermore there are many technical terms attached to errors analysis such as backward errors, forward errors, relative errors, absolute errors etc. However, error analysis of FFT computations is not a big issue and most people are fairly happy about the accuracy they get from a standard package. In practical problems such as in signal and image processing, measurement (data) errors are dominant compared with rounding errors unless the data sets are huge. For very large N then a higher precision should be used if you are worried about the accuracy.

I agree with you.

I use FFT on PDE/ODE computation (some special case, not general method for PDE/ODE or used as precondition),

I use FFT to solve a linear system from PDE/ODE.

FFT is a unitary transformation, then it is stable than LU-factorization (although LU is backward stable).

before I use CPU package (original:F77, but I translate it to C-code by f2c ), of course I have

evaluated the package in “double” and high precision package, downloaded from http://crd.lbl.gov/~dhbailey/mpdist/

now I want to use GPU as main computation device, so I need to ask accuracy of cuFFT (comparable with CPU version).

However old version of cuda 2.3 is not accurate on “double”, but now it seems O.K.

I do not believe that LU is not backward stable but certain amount of stability is gained by pivoting. The old version of FFT perhaps had a bug (and CUDA designers probably do not want to discuss about it since it has been corrected).