accuracy of CUFFT under double precision

LSChien · August 3, 2009, 8:16am

dear all:
I found that CUFFT (under double precision) is only accurate when size of FFT is power of 2

for example: consider N = 7
input data: x[j] = cos( 2pij/N ), j = 0, 1, 2, …, 6

x[0] = ( 1.000000000000000E+000, 0.000000E+000)
x[1] = ( 6.234898018587336E-001, 0.000000E+000)
x[2] = ( -2.225209339563143E-001, 0.000000E+000)
x[3] = ( -9.009688679024190E-001, 0.000000E+000)
x[4] = ( -9.009688679024192E-001, 0.000000E+000)
x[5] = ( -2.225209339563146E-001, 0.000000E+000)
x[6] = ( 6.234898018587334E-001, 0.000000E+000)

after using
cufftExecZ2Z( plan, x, y, CUFFT_FORWARD)

we have

y[0] = (-2.220446049250313E-016, 0.000000E+000)
y[1] = (3.500000000371433E+000, -4.653094E-016)
y[2] = (-2.656005562790720E-010, -2.129252E-017)
y[3] = (-1.058317940491451E-010, -9.278640E-018)
y[4] = (-1.058317940491451E-010, 9.278640E-018)
y[5] = (-2.656005562790720E-010, 2.129252E-017)
y[6] = (3.500000000371433E+000, 4.653094E-016)

but exact result of forward FFT is

y[1] = 3.5
y[6] = 3.5
y[j] = 0, otherwise

Hence CUFFT only has 10 digits accuracy in this case.

However if one tries N = 8, then fft(x) has 16 digits accuracy

also for single precision test, fft(x) has 7 digits accuracy even for N = 7

Why can not CUFFT have machine accuracy under double precision for size ~= 2^N ?

ps: I use cuda 2.3 on GTX 295
LungSheng

mfatica · August 3, 2009, 5:14pm

It is a known bug that has been fixed internally.
A new cufft library will be released shortly.

LSChien · August 4, 2009, 7:14am

great, but I have another problem, performance of cuFFT on size not power of 2

I test 3D real FFT by using

method 1: use fortran F77 package (by Roland A. Sweet and Linda L. Lindgren )

  I convert it to C++ code by f2c and use Intel C++ compiler 11.1.035, cuda2.3

method 2: use cufftExecZ2Z or cufftExecC2C

platform: Q6600, 16GB RAM, GTX295, vc2005, icpc 11.1.035

procedure of experiment

step 1: random generate real data

step 2: copy data to cufftDoubleComplex / cufftComplex

step 3: transfer data from host to device

step 4: call cufftExecZ2Z / cufftExecC2C

step 5: transfer data from device to host

description of table entry:

h2d: host to device data transfer

d2h: device to host data transfer

CPU FFTF: forward FFT of CPU version, sequential

GPU FFTF: forward FFT due to cufftExecZ2Z or cufftExecC2C

time: round to ms (use QTime (QT class) to measure time )

[codebox]table 1: result of “double precision”

------------±-------±--------±---------±--------±--------+

        |        | CPU     |   GPU    |  GPU    |  GPU    |

N       | size   | FFTF    |   h2d    |  FFTF   |  d2h    |

------------±-------±--------±---------±--------±--------+

64,64,64 | 4MB | 47 ms | 141 ms | 0 ms | 15 ms |

------------±-------±--------±---------±--------±--------+

128,128,128 | 32MB | 359 ms | 172 ms | 16 ms | 15 ms |

------------±-------±--------±---------±--------±--------+

256,256,256 | 256MB | 2156 ms | 359 ms | 78 ms | 141 ms |

------------±-------±--------±---------±--------±--------+

255,255,255 | 253MB | 2735 ms | 359 ms | 1657 ms | 140 ms |

------------±-------±--------±---------±--------±--------+

300,300,300 | 412MB | 4250 ms | 500 ms | 2125 ms | 250 ms |

------------±-------±--------±---------±--------±--------+

310,310,310 | 455MB | 5953 ms | 515 ms | 4344 ms | 266 ms |

------------±-------±--------±---------±--------±--------+

340,340,340 | 600MB | 7265 ms | 656 ms | 2813 ms | 328 ms |

------------±-------±--------±---------±--------±--------+

table 2: result of “single precision”

------------±-------±--------±---------±--------±--------+

        |        | CPU     |   GPU    |  GPU    |  GPU    |

N       | size   | FFTF    |   h2d    |  FFTF   |  d2h    |

------------±-------±--------±---------±--------±--------+

64,64,64 | 2MB | 47 ms | 141 ms | 0 ms | 0 ms |

------------±-------±--------±---------±--------±--------+

128,128,128 | 16MB | 297 ms | 141 ms | 0 ms | 16 ms |

------------±-------±--------±---------±--------±--------+

256,256,256 | 128MB | 3000 ms | 235 ms | 15 ms | 78 ms |

------------±-------±--------±---------±--------±--------+

255,255,255 | 127MB | 2843 ms | 235 ms | 141 ms | 78 ms |

------------±-------±--------±---------±--------±--------+

300,300,300 | 206MB | 3687 ms | 313 ms | 156 ms | 125 ms |

------------±-------±--------±---------±--------±--------+

310,310,310 | 227MB | 5094 ms | 328 ms | 328 ms | 125 ms |

------------±-------±--------±---------±--------±--------+

340,340,340 | 300MB | 6172 ms | 390 ms | 375 ms | 172 ms |

------------±-------±--------±---------±--------±--------+[/codebox]

d2h reach maximum bandwidth (1,7GB/sec in my machine), I foucs on

timing of FFT kernel (CPU FFTF and GPU FFTF)

It is clear that when N is power of 2, even “double precision”,

cuFFT is 20 times faster than CPU version

however if N is not power of 2, then performance is dramatically slow down

and comparable to CPU version.

This means that if N is (255,255,255), then CPU FFT + openmp is better than cuFFT

Is this normal?

Mark_Schlegel · August 12, 2009, 5:01pm

Will that mean a point release of CUDA (like CUDA 2.3.1) or just that the CUFFT will be updated and

CUDA 2.4 will be early? Since CUFFT is mixed in with the toolkit it doesn’t seem easy to just release

a new CUFFT by itself.

LSChien · September 12, 2009, 3:29am

dear all:
I found that CUFFT (under double precision) is only accurate when size of FFT is power of 2
for example: consider N = 7
input data:  x[j] = cos( 2*pi*j/N ), j = 0, 1, 2, ..., 6
x[0] = ( 1.000000000000000E+000, 0.000000E+000)

x[1] = ( 6.234898018587336E-001, 0.000000E+000)

x[2] = ( -2.225209339563143E-001, 0.000000E+000)

x[3] = ( -9.009688679024190E-001, 0.000000E+000)

x[4] = ( -9.009688679024192E-001, 0.000000E+000)

x[5] = ( -2.225209339563146E-001, 0.000000E+000)

x[6] = ( 6.234898018587334E-001, 0.000000E+000)

after using
    cufftExecZ2Z( plan, x, y, CUFFT_FORWARD)
we have

y[0] = (-2.220446049250313E-016, 0.000000E+000)

y[1] = (3.500000000371433E+000, -4.653094E-016)

y[2] = (-2.656005562790720E-010, -2.129252E-017)

y[3] = (-1.058317940491451E-010, -9.278640E-018)

y[4] = (-1.058317940491451E-010, 9.278640E-018)

y[5] = (-2.656005562790720E-010, 2.129252E-017)

y[6] = (3.500000000371433E+000, 4.653094E-016)

I install new release of toolkit 2.3, download from

http://forums.nvidia.com/index.php?showtop…rt=#entry577902

then accuracy of double precision in FFT is correct.

I take the same experiment (C2C out-of-place) for N = 7

input data:  x[j] = cos( 2*pi*j/N ), j = 0, 1, 2, ..., 6

cufftExecZ2Z( plan, x, y, CUFFT_FORWARD);

y[0] = (-2.220446049250313E-016, 0.0000000E+000)

y[1] = (3.500000000000000E+000, -4.6530940E-016)

y[2] = (3.722123552849509E-016, -2.1292515E-017)

y[3] = (1.033476309669928E-016, -9.2786395E-018)

y[4] = (1.033476309669928E-016, 9.2786395E-018)

y[5] = (3.722123552849509E-016, 2.1292515E-017)

y[6] = (3.500000000000000E+000, 4.6530940E-016)

where exact result of forward FFT is

y[1] = 3.5

y[6] = 3.5

y[j] = 0, otherwise

now CUFFT only has 15 digits accuracy in this case.

vinci · September 12, 2009, 7:18pm

I install new release of toolkit 2.3, download from

http://forums.nvidia.com/index.php?showtop…rt=#entry577902

then accuracy of double precision in FFT is correct.

I take the same experiment (C2C out-of-place) for N = 7
input data:  x[j] = cos( 2*pi*j/N ), j = 0, 1, 2, ..., 6

cufftExecZ2Z( plan, x, y, CUFFT_FORWARD);
y[0] = (-2.220446049250313E-016, 0.0000000E+000)

y[1] = (3.500000000000000E+000, -4.6530940E-016)

y[2] = (3.722123552849509E-016, -2.1292515E-017)

y[3] = (1.033476309669928E-016, -9.2786395E-018)

y[4] = (1.033476309669928E-016, 9.2786395E-018)

y[5] = (3.722123552849509E-016, 2.1292515E-017)

y[6] = (3.500000000000000E+000, 4.6530940E-016)

where exact result of forward FFT is

y[1] = 3.5

y[6] = 3.5

y[j] = 0, otherwise

now CUFFT only has 15 digits accuracy in this case.

For error computations, it is better to divide by the element with the largest modulus (in this case 3.5) and hence the errors approach machine precision. However, due to round-off errors do not expect machine accuracy for large arrays in any floating-point computing system.

LSChien · September 13, 2009, 5:37am

yes, you are right.

besides, accumulation of rounding error of inner product should be O(N)

but for FFT, it should be O(logN).

if I was wrong, then please correct me.

vinci · September 13, 2009, 2:48pm

Without a formal error analysis (which could be difficult or impossible) your error estimates cannot be conformed. Furthermore there are many technical terms attached to errors analysis such as backward errors, forward errors, relative errors, absolute errors etc. However, error analysis of FFT computations is not a big issue and most people are fairly happy about the accuracy they get from a standard package. In practical problems such as in signal and image processing, measurement (data) errors are dominant compared with rounding errors unless the data sets are huge. For very large N then a higher precision should be used if you are worried about the accuracy.

LSChien · September 14, 2009, 2:40am

I agree with you.

I use FFT on PDE/ODE computation (some special case, not general method for PDE/ODE or used as precondition),

I use FFT to solve a linear system from PDE/ODE.

FFT is a unitary transformation, then it is stable than LU-factorization (although LU is backward stable).

before I use CPU package (original:F77, but I translate it to C-code by f2c ), of course I have

evaluated the package in “double” and high precision package, downloaded from http://crd.lbl.gov/~dhbailey/mpdist/

now I want to use GPU as main computation device, so I need to ask accuracy of cuFFT (comparable with CPU version).

However old version of cuda 2.3 is not accurate on “double”, but now it seems O.K.

vinci · September 18, 2009, 1:38pm

I do not believe that LU is not backward stable but certain amount of stability is gained by pivoting. The old version of FFT perhaps had a bug (and CUDA designers probably do not want to discuss about it since it has been corrected).

Topic		Replies	Views
my speedy FFT 3x faster than CUFFT CUDA Programming and Performance	139	241043	November 16, 2011
CUFFT and FFTW Numeric Accuracy CUDA Programming and Performance	9	20347	May 28, 2009
CUFFT_INTERNAL_ERROR during creation of a 1D Plan in CUFFT GPU-Accelerated Libraries cuda , cufft	11	3764	October 19, 2022
Questions about cuFFT for 3D matrix, arrayFire GPU-Accelerated Libraries	5	1660	October 12, 2021
Trouble with cuFFT on multiple GPUs GPU-Accelerated Libraries	13	3694	August 26, 2017
CUFFT appears to give errors for vectors > 1024 CUDA Programming and Performance	6	8765	April 12, 2007
CUFFT run wrong CUDA Programming and Performance	16	2806	May 23, 2013
size limit of 1D FFT CUDA Programming and Performance	8	2526	September 24, 2011
2D CUFFT wrong result GPU-Accelerated Libraries cufft	8	3088	November 7, 2023
Multiple batches of 1D FFT using cuFFT GPU-Accelerated Libraries	10	5043	October 29, 2019

accuracy of CUFFT under double precision

Related topics