Floating Point Accuracy

I don’t suppose any one has looked at the accuracy of the GPU floating point unit.

I’m trying to do some benchmarking of “Novel Processor Architectures for HPC” and one of the area I’m keen to explore is how the deviation from IEEE-754 thats inherent with CUDA can affect the results of a program.

If anyone has done, or sees a large deviation against a golden reference can they let me know please.

Cheers,

Chris

I’ll be performance such tests on my system later this week. I’ll let you know what happens.

For our codes the deviations have been largely unnoticable. A lot of codes out there do “bad things” with their floating point already, and so there are greater sins being committed by the algorithm than by the floating point hardware on the GPU or CPU… Unless you’ve taken some care to avoid things like summing large numbers with small ones, and various other floating point pitfalls, the GPU may be the least of your problems…

John Stone

Thats good to know.

I’m benchmarking CUDA against FPGAs and a regular CPU at the moment and if people came back to me saying there were big problems I was going to have to write a test case to see how serious the problems were. …now i can just gloss over it and say that the lack of full ieee compliance is almost insignficant.

Cheers,

Chris

How significant it is depends on the exact code sequence and input data. In practice, it’s wise to design your algorithm to explicitly avoid doing things that put the floating point hardware over the ropes. I don’t know what your algorithm is, I can only comment on the ones we’ve been working on.

If you do observe non-trivial differences between CPU, GPU, and FPGA, then you should go back and look at whether your algorithm is doing things that are ill-advised in terms of its use of floating point. Only after you’re sure that you’ve done everything you can to preserve numerical precision in your algorithm would it be fair to point your finger at the hardware. :-)

John Stone

Just an FYI. We compared the error of x-ifft2(fft2(x)) for cuda(single) and matlab (double). The error did not seem to grow much with matrix size.

IFFT2(FFT2(X[256 by 256])) with a cuda error of 2.669e-007 and a matlab error of 2.3853e-016
IFFT2(FFT2(X[512 by 512])) with a cuda error of 3.9338e-007 and a matlab error of 2.5231e-016
IFFT2(FFT2(X[1024 by 1024])) with a cuda error of 3.5079e-007 and a matlab error of 2.6822e-016
IFFT2(FFT2(X[2048 by 2048])) with a cuda error of 5.149e-007 and a matlab error of 2.8168e-016
IFFT2(FFT2(X[3072 by 3072])) with a cuda error of 7.3526e-007 and a matlab error of 2.9145e-016

our error metric was the mean of the absolute value between x, and ifft2(fft2(x)) (not squared). The code for the calcs is attached along with our computed histogram of error values for the cuda and matlab. The Cuda looked fairly Gaussian (probably more so if we used mse). Anyway, just an FYI as to our results. We have noticed some unusual error growth with some of our conjugant gradient algorithms, but haven’t concluded if this is CUDA or ill-conditioned data.

Cheers,
Paul
speed_fft_v2.txt (1.51 KB)
error_histogram.PNG

As promised: I check the numerical accuracy of my application. A note first: my application is chaotic. A single value difference in the 18th decimal place in part of the calculation can send the simulation down a completely different path. I don’t care because there are billions of billions of statistically equivalent paths my simulation can take. The downside to this is that it makes quantitative accuracy comparisons difficult. Using a double precision CPU calculation on a single processor as a baseline, I found the number of iterations it takes for a different simulation with the same starting point to deviate significantly.

  1. Double precision CPU calculation on 8 processors
    5800 steps to deviate
  2. Single precision CPU calculation on 1 processor
    3800 steps to deviate
  3. Single precision calculation on 1 GPU
    3800 steps to deviate

So, the “inferior” floating point on the GPU is just as good as a CPU single precision calculation for me. I imagine you’d have to come up with a pretty contrived example to show that the GPU’s FP is significantly worse.

Thanks guys.

Thats the kind of information I was looking for. I did an example myself preforming some simple calculations (log, sin, cos etc) and there didn’t seem to be any difference between the single precision GPU and single precision CPU. Your more in depth examples further encourage me that there is little problem with the device not being fully IEEE compliant. I cant really think of a better test for floating point deviation than what your code would do Mr Anderson and I think I might use mathlab and my sample code to create a graph very similar to yours Paul.

Cheers,

Chris

This is a comparison of the G80 and other architectures on IEEE 754.

Massimiliano
G80_IEEE.pdf (200 KB)

Hi there,

this is a bit late, but it may be useful to whoever reads this thread in the future.

Here are two papers I had found a while back that was trying to understand the limitations in IEEE 754 compliance of GPUs specifically for scientific applications.

In this case, the author figured out how to achieve double precision with some extra code, and still achieve 4-5 x speed increase relative to CPUs.

Cheers

Chahé
ijpeds06.pdf (1.02 MB)
SC06.pdf (594 KB)

Can somebody see why this simple operation can produce a 10^-3 error compared to CPU?

float deltaE = mbfsIn.kb[tid] * delta_r01
* (2.0f - delta_r01 * (6.0f - 9.333334f * delta_r01));

cheers,
Thanasio

What GPU? What’s the value of delta_r01? I will assume you are running on a Fermi or Kepler class GPU. The compiler very likely generates code using two FFMAs (single-precision fused multiply-adds) for the latter part of the computation, that is,

[expr] * fmaf (fmaf (9.333334f, -delta_r01, 6.0f), -delta_r01 , 2.0f)

If either of the two products is close to the corresponding constant, but of opposite sign (meaning delta_r01 is positive) there will be subtractive cancellation, followed by renormalization. On the CPU, where the product is computed to single precision, the bits shifted in on the right will be zero, but on the GPU where all product bits are retained inside the FMA, lower order bits of the product will be shifted in on the right. The closer the product to the constant, the bigger the difference will be.

If you look at the bit pattern of the intermediate result, and see trailing zeros in the CPU result, but non-zero trailing bits in the GPU result, that would be a good indication that my working hypothesis is correct, and in that case the GPU delivers the more accurate result thanks to FMA.

If your turn off FMA generation with -fmad=false, do the results match between CPU and GPU?

I would suggest reading the following whitepaper and also the references it cites:

https://developer.nvidia.com/sites/default/files/akamai/cuda/files/NVIDIA-CUDA-Floating-Point.pdf