Float accuracy

Cuda_Libre · July 14, 2010, 4:00pm

Hi,

I wrote a CUDA simulation that works quite fine, but I just found a problem :
if I run it 2 times and perform a “diff” on the final output, there are little differences.
It comes after a lot of iterations, and the difference is absolutely not significant (~1/100000 relative error), but I was wondering :

is it a bug of my program ?
is CUDA garanteeing that same data = exactly same output ?

I’m just using :
max, min, “<” and “>”, sqrtf, divide, add, subtract, multiply, and not using --use_fast_math

Thanks !

YDD · July 14, 2010, 4:38pm

If you have no race conditions, and no hardware faults then I think you should get the same answers.

_Big_Mac · July 14, 2010, 5:55pm

Some operations may get sent to SFUs while other get computed in SPs.

If you’re doing a parallel reduction, that’s another possible cause for small differences (since A+B != B+A and the order of reduction is undefined).

YDD · July 14, 2010, 6:28pm

But shouldn’t the choice of which get sent be deterministic? Same for the reduction. The same input array should reduce to the same sum, since the sequence of parallel operations should remain the same (even if the relative timing does not), and commutativity should hold in IEEE arithmetic. Changing the order of the input array will naturally change the final sum, since associativity does not hold.

Cuda_Libre · July 14, 2010, 6:35pm

Interesting !

What’s the general behavior ?

I’m using cublasIsamax, but I guess it’s only computing lots of “max(a, B)”… and it would be quite annoying if max(a, B) != max(b, a) !

laughingrice · July 14, 2010, 8:54pm

I don’t know the implementation of sqrtf on the gpu, but some implementations use a random initializer. Could be that it may change the list significant bit

_Big_Mac · July 15, 2010, 6:47am

MULs can be calculated either on SPs or on SFUs. It’s up to the scheduler to figure out where to route a MUL instruction and it’s dynamic. There might be differences in implementations.

SFUs also have native RCP and RSQ, so they might calculate your sqrts. SPs can also do this in software. I’m not sure whether the scheduler is free to do sqrts in both places without explicitly doing --fast-math or something. They might also take on divisions (with RCP).

So it’s likely that some instructions are handed to SFUs from time to time. This would be indeterministic.

Nighthawk13 · July 15, 2010, 5:04pm

Right about the MUL(GT200 only). But should not both MUL units behave the same?

How division/sqrt is implemented is decided at compile time. The hardware scheduler only sees the resulting instruction sequence(rcp/rsqrt/mad).

On Fermi, atomicAdd(float) might cause indeterministic order.

(And just for completeness: if a write occurs to the same memory location from different threads at the same time, it is undefined which of these values will actually be stored. So yes, it could be a bug in your program).

Cuda_Libre · July 15, 2010, 5:29pm

Thank you for all your answers, it seems I must check the source code again.

I don’t think I have concurrent writes to the same memory location (the code is not supposed to do this kind of things) , but… I’ll pay a particular attention to that, thanks.

Or perhaps a missing __syncthreads() in a very vicious place… ! External Media

YDD · July 15, 2010, 5:36pm

Those are possibilties I hadn’t considered (I’d argue that atomics come under the general heading of ‘race conditions’ but they absolutely merit separate mention). It would be nice to see a specific statement from NVIDIA.

YDD · July 15, 2010, 5:36pm

Delete double post

_Big_Mac · July 16, 2010, 6:19pm

I don’t know really. They’re different pieces hardware and it’s possible they behave differently, but this could only be answered from someone who’s either an NVIDIA engineer or has made extensive microtests.

I usually just assume floats are wild, untamed beasts and whatever happens to the least significant digits is up to the moon cycle. And that == for floats is pretty much invalid C++ :)

Ooh, those can be evil bugs. I had one in my last code (neural network), took me a couple of hours of staring in the source, debuggers being useless.

Perhaps a good debugging strategy is to make a syncthreads after every shared memory access (read or write) and if that works, keep commenting them out until something breaks.

ONeill · July 19, 2010, 10:10am

What I have seen when comparing CUFFT results to Matlab results is a relative error of somewhat near 10e-7 for single precision and 10e-14 for double precision. Sadly even with a GTX 480 this didnt significantly improve over the old GT200 despite of Fermi cards being conformer to IEEE. I heared one possible reason would be the sin/cos (hardware) implementations on the GPU. Doing a couple of FFTs in my algorithm has shown that results differ from Matlab results by more than 1% in single precision so I see CUFFT as a real problem regarding its accuracy. Maybe it could help if they provide some functions to which you could pass sin/cos depending values you have pre-calculated by CPU (e.g. for plan creation).

YDD · July 19, 2010, 1:12pm

Errr… 1e-7 single precision and 1e-14 double precision are, to all intents and purposes, identical.

Now, a 1% error after a single CUFFT call would be significant. What sort of data set is this on?

ONeill · July 22, 2010, 7:48am

What do u mean with -7 and -14 are identical? I mean with SP i often saw errors like at the 7th decimal place and with DP at the 14th decimal place in comparison to CPU results (which depend on a different FFT implementation so this is only a rough comparison oc).
This error of 1% I dont get from a single call oc. Its after a couple of FFTs of smaller sizes, each 1d C2C and some other maths in SP.
What I mean is, errors of this magnitude with GPUs dont look that strange to me.

YDD · July 22, 2010, 1:28pm

What I mean is that single precision answers which differ at the seventh decimal place should be considered identical. Similarly for double precision answers differing at the 14th decimal place. If this seems a surprise, do some reading.

Your error of 1% is of greater interest/concern.

Cuda_Libre · July 22, 2010, 11:18pm

Ok, bug found after hours of investigation !
I have a loop that iterates along Z, with 18 syncthreads, but I missed the one just before the end of the iteration… shame shame
Thank you, this topic is very instructing (and especially the link “What Every Computer Scientist Should Know About Floating-Point Arithmetic”, highly recommendable ! thanks a lot)

Topic		Replies	Views
Floats and floats... difference between CPU and GPU? CUDA Programming and Performance	12	13985	February 2, 2010
CUDA innacuracy? CUDA float produces different result from CPU float CUDA Programming and Performance	8	3023	September 9, 2011
Floating Point Accuracy CUDA Programming and Performance	11	30413	April 6, 2013
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	669	April 4, 2017
Double precision Accuracy with sqrt, log math functions Results on CPU & GPU are not exactly sam CUDA Programming and Performance	9	5399	April 12, 2012
Why accuracy CPU and GPU not equal? CUDA Programming and Performance	6	10892	October 28, 2014
floating point error Error with floating point division CUDA Programming and Performance	9	8363	November 30, 2007
Accuracy in GPU floating point calculations CUDA Programming and Performance	35	8203	September 9, 2011
Float precision error in matrix multiplication application. CUDA Programming and Performance	14	3522	February 27, 2014
CUDA book by Kirk & Whu available CUDA Programming and Performance	44	12103	February 10, 2010

Float accuracy

Related topics