Float accuracy


I wrote a CUDA simulation that works quite fine, but I just found a problem :
if I run it 2 times and perform a “diff” on the final output, there are little differences.
It comes after a lot of iterations, and the difference is absolutely not significant (~1/100000 relative error), but I was wondering :

  • is it a bug of my program ?
  • is CUDA garanteeing that same data = exactly same output ?

I’m just using :
max, min, “<” and “>”, sqrtf, divide, add, subtract, multiply, and not using --use_fast_math

Thanks !

If you have no race conditions, and no hardware faults then I think you should get the same answers.

Some operations may get sent to SFUs while other get computed in SPs.

If you’re doing a parallel reduction, that’s another possible cause for small differences (since A+B != B+A and the order of reduction is undefined).

But shouldn’t the choice of which get sent be deterministic? Same for the reduction. The same input array should reduce to the same sum, since the sequence of parallel operations should remain the same (even if the relative timing does not), and commutativity should hold in IEEE arithmetic. Changing the order of the input array will naturally change the final sum, since associativity does not hold.

Interesting !

What’s the general behavior ?

I’m using cublasIsamax, but I guess it’s only computing lots of “max(a, B)”… and it would be quite annoying if max(a, B) != max(b, a) !

I don’t know the implementation of sqrtf on the gpu, but some implementations use a random initializer. Could be that it may change the list significant bit

MULs can be calculated either on SPs or on SFUs. It’s up to the scheduler to figure out where to route a MUL instruction and it’s dynamic. There might be differences in implementations.

SFUs also have native RCP and RSQ, so they might calculate your sqrts. SPs can also do this in software. I’m not sure whether the scheduler is free to do sqrts in both places without explicitly doing --fast-math or something. They might also take on divisions (with RCP).

So it’s likely that some instructions are handed to SFUs from time to time. This would be indeterministic.

Right about the MUL(GT200 only). But should not both MUL units behave the same?

How division/sqrt is implemented is decided at compile time. The hardware scheduler only sees the resulting instruction sequence(rcp/rsqrt/mad).

On Fermi, atomicAdd(float) might cause indeterministic order.

(And just for completeness: if a write occurs to the same memory location from different threads at the same time, it is undefined which of these values will actually be stored. So yes, it could be a bug in your program).

Thank you for all your answers, it seems I must check the source code again.

I don’t think I have concurrent writes to the same memory location (the code is not supposed to do this kind of things) , but… I’ll pay a particular attention to that, thanks.

Or perhaps a missing __syncthreads() in a very vicious place… ! :pirate:

Those are possibilties I hadn’t considered (I’d argue that atomics come under the general heading of ‘race conditions’ but they absolutely merit separate mention). It would be nice to see a specific statement from NVIDIA.

Delete double post

I don’t know really. They’re different pieces hardware and it’s possible they behave differently, but this could only be answered from someone who’s either an NVIDIA engineer or has made extensive microtests.

I usually just assume floats are wild, untamed beasts and whatever happens to the least significant digits is up to the moon cycle. And that == for floats is pretty much invalid C++ :)

Ooh, those can be evil bugs. I had one in my last code (neural network), took me a couple of hours of staring in the source, debuggers being useless.

Perhaps a good debugging strategy is to make a syncthreads after every shared memory access (read or write) and if that works, keep commenting them out until something breaks.

What I have seen when comparing CUFFT results to Matlab results is a relative error of somewhat near 10e-7 for single precision and 10e-14 for double precision. Sadly even with a GTX 480 this didnt significantly improve over the old GT200 despite of Fermi cards being conformer to IEEE. I heared one possible reason would be the sin/cos (hardware) implementations on the GPU. Doing a couple of FFTs in my algorithm has shown that results differ from Matlab results by more than 1% in single precision so I see CUFFT as a real problem regarding its accuracy. Maybe it could help if they provide some functions to which you could pass sin/cos depending values you have pre-calculated by CPU (e.g. for plan creation).

Errr… 1e-7 single precision and 1e-14 double precision are, to all intents and purposes, identical.

Now, a 1% error after a single CUFFT call would be significant. What sort of data set is this on?

What do u mean with -7 and -14 are identical? I mean with SP i often saw errors like at the 7th decimal place and with DP at the 14th decimal place in comparison to CPU results (which depend on a different FFT implementation so this is only a rough comparison oc).
This error of 1% I dont get from a single call oc. Its after a couple of FFTs of smaller sizes, each 1d C2C and some other maths in SP.
What I mean is, errors of this magnitude with GPUs dont look that strange to me.

What I mean is that single precision answers which differ at the seventh decimal place should be considered identical. Similarly for double precision answers differing at the 14th decimal place. If this seems a surprise, do some reading.

Your error of 1% is of greater interest/concern.

Ok, bug found after hours of investigation !
I have a loop that iterates along Z, with 18 syncthreads, but I missed the one just before the end of the iteration… shame shame
Thank you, this topic is very instructing (and especially the link “What Every Computer Scientist Should Know About Floating-Point Arithmetic”, highly recommendable ! thanks a lot)