CUFFT and FFTW Numeric Accuracy

I’ve been playing around with CUDA 2.2 for the last week and, as practice, started replacing Matlab functions (interp2, interpft) with CUDA MEX files. When I first noticed that Matlab’s FFT results were different from CUFFT, I chalked it up to the single vs. double precision issue. However, the differences seemed too great so I downloaded the latest FFTW library and did some comparisons.

Using a complex input vector of 512 records:

  1. I took the absolute difference from Matlab’s FFT result and plotted for FFTW-DP, FFTW-SP, CUDA
  2. I did the FFT followed by the IFFT (with appropriate scaling) and compared to the original data.

Both plots are attached to this post.

I found that CUFFT results were quite a bit “worse” than the FFTW-SP results… by a mean factor of 4-5 times across the 512 records. Note that when I input a vector with 600 complex samples, the CUFFT results were about 8 times worse than FFTW-SP (due to size not being a nice factor of 2,3,5).

I guess my question is: is this type of error expected between two single-precision methods? I am looking to implement CUFFT in an application that is sensitive to signal phase and was hoping for better.

I’d appreciate any comments that could help me better understand why this difference exists. Thanks!

Take a look at vvolkov’s improved FFT code. It was supposed to be included in the last couple of CUFFT releases, but I don’t think that it’s made it in yet. Perhaps that will have better numerical stability than the standard CUFFT release.

Also, I’m not sure how CUFFT has been written to handle non-power-of-two sizes…but if it hasn’t been done properly, it would explain why you get much less accurate results for N = 600 vs. N = 512.

I had a look at vvolkov’s FFT and saw how he was calculating him FFT error:…st&p=527953

I applied the same calculations to my results and got the following:

N=512   random numbers   ULP (FFTW-SP)=0.80   ULP (CUFFT):1.50

N=4096  random numbers   ULP (FFTW-SP)=1.0	ULP (CUFFT):2.0

N=8192  random numbers   ULP (FFTW-SP)=1.0	ULP (CUFFT):2.3

N=512   my data slice	ULP (FFTW-SP)=0.93   ULP (CUFFT):2.65

N=600   my data slice	ULP (FFTW-SP)=1.04   ULP (CUFFT):5.03

Comparing these numbers to all the performance tables in the vvolkov thread, it seems that the errors I am seeing are in the expected range, even though they can be several times worse those from the FFTW single-precision calculations.

For interest’s sake, here are the embarrassing performance metrics of vvolkov’s FFT code for my system (NVS290 only has 2 multi-processsors). You’ll see that it crashed part way through. Note that the listed error is the maximum found in all the batches.

Device: Quadro NVS 290, 918 MHz clock, 256 MB memory.

Compiled with CUDA 2020.

			 --------CUFFT-------  ---This prototype---  ---two way--

   N   Batch Gflop/s  GB/s  error  Gflop/s  GB/s  error  Gflop/s erro

   8 1048576	0.4	0.5   1.8	  3.6	3.8   1.6	  3.6   2.5

  16  524288	1.0	0.8   2.1	  4.8	3.8   1.4	  4.8   1.8

  64  131072	2.6	1.4   2.4	  7.2	3.9   2.2	  7.2   2.9

  256  32768	4.4	1.8   2.2	  8.0	3.2   2.0	  7.4   3.0

  512  16384	3.0	1.1   3.0	  9.9	3.5   2.5	  9.9   3.7

  1024  8192	3.8	1.2   2.7	  8.8	2.8   2.4	  8.9   3.9

  2048  4096	2.1	0.6   3.7	  6.4	1.8   3.0	  6.3   4.5

  4096  2048cufft: ERROR: D:/Bld/rel/gpgpu/toolkit/r2.2/cufft/src/, line 315


While vvolkov’s prototype has slightly better error performance than CUFFT, it still doesn’t touch the accuracy of single precision FFTW. Also see:…st&p=493453

profquail, thanks for steering me towards vvolkov’s FFT.

Did you install FFTW with SSE/etc. enabled?

I used the precompiled binaries from which use SSE for single-precision and SSE2 for double precision. So yes, the FFTW library that I was using had SSE enabled. Do you expect SSE to change the accuracy of FFTW?

Well, I expect FFTW without SSE to be more accurate than with SSE.

Anyway, I just talked to the main CUFFT developer, and there were two main points (but keep in mind I am not an expert at all on things like this so I may be bungling what he said–I’m a systems guy):

  1. he thinks that comparing error by doing IFFT(FFT(data)) is not necessarily the best way to do it–he recommends doing only an FFT and comparing the results of FFTW and CUFFT.

  2. if there are particular sizes you are worried about for whatever reason, you are more than welcome to file a bug as a registered developer.

My first test was as you suggested: comparing FFT results from FFTW single-precision and CUFFT. I included the IFFT(FFT(X)) test because I was implementing a replacement to Matlab’s interpft (FFT, zero-pad, IFFT ).

It seems to vary from FFTW for all sizes, just worse for lengths that are not powers of small primes (as stated in the CUFFT 2.1 doc). I also tried out vvolkov’s FFT and his errors were similar to the CUFFT implementation, which leads me to think that the errors are on the expected level… I just don’t understand why they are so much worse that FFTW (relatively speaking, of course) .

Sin and cos implemented in GPU hardware have lower precision than on CPU. The description of __sinf and cosf in CUDA programing guide (…2.pdf#page=131) details out the absolute error of 2^-21.19. This is a few bits fewer than full IEEE single precision (I believe it is 2^-23, which is the largest relative gap between two subsequent floating point numbers. It’s not half of it since the standard does not require transcendental functions to be exactly rounded because of table makers’s dilemma).

CUDA also has full-precision sin and cos operations. But they are implemented in software and run much slower.

You can find similar accuracy findings in Govindaraju et al. “High Performance Discrete Fourier Transforms on Graphics Processors”, 2008. They also discussed the poor accuracy in arbitrary length FFTs.

That paper is perfect! Thanks for the link. Any idea if they are making (have made) their algorithms public?

Probably not. As somebody wrote in other thread (…st&p=468989):