CUFFT and FFTW Numeric Accuracy

Jeru · May 25, 2009, 10:50pm

I’ve been playing around with CUDA 2.2 for the last week and, as practice, started replacing Matlab functions (interp2, interpft) with CUDA MEX files. When I first noticed that Matlab’s FFT results were different from CUFFT, I chalked it up to the single vs. double precision issue. However, the differences seemed too great so I downloaded the latest FFTW library and did some comparisons.

Using a complex input vector of 512 records:

I took the absolute difference from Matlab’s FFT result and plotted for FFTW-DP, FFTW-SP, CUDA
I did the FFT followed by the IFFT (with appropriate scaling) and compared to the original data.

Both plots are attached to this post.

I found that CUFFT results were quite a bit “worse” than the FFTW-SP results… by a mean factor of 4-5 times across the 512 records. Note that when I input a vector with 600 complex samples, the CUFFT results were about 8 times worse than FFTW-SP (due to size not being a nice factor of 2,3,5).

I guess my question is: is this type of error expected between two single-precision methods? I am looking to implement CUFFT in an application that is sensitive to signal phase and was hoping for better.

I’d appreciate any comments that could help me better understand why this difference exists. Thanks!

jack · May 26, 2009, 5:38am

Take a look at vvolkov’s improved FFT code. It was supposed to be included in the last couple of CUFFT releases, but I don’t think that it’s made it in yet. Perhaps that will have better numerical stability than the standard CUFFT release.

Also, I’m not sure how CUFFT has been written to handle non-power-of-two sizes…but if it hasn’t been done properly, it would explain why you get much less accurate results for N = 600 vs. N = 512.

Jeru · May 26, 2009, 4:31pm

I had a look at vvolkov’s FFT and saw how he was calculating him FFT error: http://forums.nvidia.com/index.php?s=&…st&p=527953

I applied the same calculations to my results and got the following:

N=512   random numbers   ULP (FFTW-SP)=0.80   ULP (CUFFT):1.50

N=4096  random numbers   ULP (FFTW-SP)=1.0	ULP (CUFFT):2.0

N=8192  random numbers   ULP (FFTW-SP)=1.0	ULP (CUFFT):2.3

N=512   my data slice	ULP (FFTW-SP)=0.93   ULP (CUFFT):2.65

N=600   my data slice	ULP (FFTW-SP)=1.04   ULP (CUFFT):5.03

Comparing these numbers to all the performance tables in the vvolkov thread, it seems that the errors I am seeing are in the expected range, even though they can be several times worse those from the FFTW single-precision calculations.

For interest’s sake, here are the embarrassing performance metrics of vvolkov’s FFT code for my system (NVS290 only has 2 multi-processsors). You’ll see that it crashed part way through. Note that the listed error is the maximum found in all the batches.

Device: Quadro NVS 290, 918 MHz clock, 256 MB memory.

Compiled with CUDA 2020.

			 --------CUFFT-------  ---This prototype---  ---two way--

   N   Batch Gflop/s  GB/s  error  Gflop/s  GB/s  error  Gflop/s erro

   8 1048576	0.4	0.5   1.8	  3.6	3.8   1.6	  3.6   2.5

  16  524288	1.0	0.8   2.1	  4.8	3.8   1.4	  4.8   1.8

  64  131072	2.6	1.4   2.4	  7.2	3.9   2.2	  7.2   2.9

  256  32768	4.4	1.8   2.2	  8.0	3.2   2.0	  7.4   3.0

  512  16384	3.0	1.1   3.0	  9.9	3.5   2.5	  9.9   3.7

  1024  8192	3.8	1.2   2.7	  8.8	2.8   2.4	  8.9   3.9

  2048  4096	2.1	0.6   3.7	  6.4	1.8   3.0	  6.3   4.5

  4096  2048cufft: ERROR: D:/Bld/rel/gpgpu/toolkit/r2.2/cufft/src/config.cu, line 315

cufft: ERROR: CUFFT_ALLOC_FAILED

While vvolkov’s prototype has slightly better error performance than CUFFT, it still doesn’t touch the accuracy of single precision FFTW. Also see: http://forums.nvidia.com/index.php?s=&…st&p=493453

profquail, thanks for steering me towards vvolkov’s FFT.

tmurray · May 26, 2009, 5:06pm

Did you install FFTW with SSE/etc. enabled?

Jeru · May 26, 2009, 6:36pm

I used the precompiled binaries from http://www.fftw.org/install/windows.html which use SSE for single-precision and SSE2 for double precision. So yes, the FFTW library that I was using had SSE enabled. Do you expect SSE to change the accuracy of FFTW?

tmurray · May 26, 2009, 6:47pm

Well, I expect FFTW without SSE to be more accurate than with SSE.

Anyway, I just talked to the main CUFFT developer, and there were two main points (but keep in mind I am not an expert at all on things like this so I may be bungling what he said–I’m a systems guy):

he thinks that comparing error by doing IFFT(FFT(data)) is not necessarily the best way to do it–he recommends doing only an FFT and comparing the results of FFTW and CUFFT.
if there are particular sizes you are worried about for whatever reason, you are more than welcome to file a bug as a registered developer.

Jeru · May 27, 2009, 2:57pm

My first test was as you suggested: comparing FFT results from FFTW single-precision and CUFFT. I included the IFFT(FFT(X)) test because I was implementing a replacement to Matlab’s interpft (FFT, zero-pad, IFFT ).

It seems to vary from FFTW for all sizes, just worse for lengths that are not powers of small primes (as stated in the CUFFT 2.1 doc). I also tried out vvolkov’s FFT and his errors were similar to the CUFFT implementation, which leads me to think that the errors are on the expected level… I just don’t understand why they are so much worse that FFTW (relatively speaking, of course) .

vvolkov · May 28, 2009, 4:50am

Sin and cos implemented in GPU hardware have lower precision than on CPU. The description of __sinf and cosf in CUDA programing guide (http://developer.download.nvidia.com/compu…2.pdf#page=131) details out the absolute error of 2^-21.19. This is a few bits fewer than full IEEE single precision (I believe it is 2^-23, which is the largest relative gap between two subsequent floating point numbers. It’s not half of it since the standard does not require transcendental functions to be exactly rounded because of table makers’s dilemma).

CUDA also has full-precision sin and cos operations. But they are implemented in software and run much slower.

You can find similar accuracy findings in Govindaraju et al. “High Performance Discrete Fourier Transforms on Graphics Processors”, 2008. They also discussed the poor accuracy in arbitrary length FFTs.

Jeru · May 28, 2009, 2:30pm

That paper is perfect! Thanks for the link. Any idea if they are making (have made) their algorithms public?

vvolkov · May 28, 2009, 4:52pm

Probably not. As somebody wrote in other thread (http://forums.nvidia.com/index.php?showtop…st&p=468989):

Topic		Replies	Views
accuracy of CUFFT under double precision CUDA Programming and Performance	9	4137	September 18, 2009
CUFFT appears to give errors for vectors > 1024 CUDA Programming and Performance	6	8764	April 12, 2007
my speedy FFT 3x faster than CUFFT CUDA Programming and Performance	139	241015	November 16, 2011
CUFFT run wrong CUDA Programming and Performance	16	2804	May 23, 2013
FFT Speed vs. x86 CUDA Programming and Performance	14	24639	July 27, 2008
Expected performance of double precision arithmetic CUDA Programming and Performance	8	3998	August 20, 2009
Apparently bug in CUFFT of CUDA 7.5 with (deprecated) NATIVE Compatibility CUDA Programming and Performance	3	1752	December 21, 2015
Are double precision functions in CUDA MATH API only the copy-paste version of single precision func CUDA Programming and Performance	4	1953	June 28, 2014
Float accuracy CUDA Programming and Performance	16	9358	July 22, 2010
cufft doubt comparing r2c and c2c 2D FFTs CUDA Programming and Performance	28	13479	October 27, 2010

CUFFT and FFTW Numeric Accuracy

Related topics