I wrote a CUDA simulation that works quite fine, but I just found a problem :
if I run it 2 times and perform a “diff” on the final output, there are little differences.
It comes after a lot of iterations, and the difference is absolutely not significant (~1/100000 relative error), but I was wondering :
is it a bug of my program ?
is CUDA garanteeing that same data = exactly same output ?
I’m just using :
max, min, “<” and “>”, sqrtf, divide, add, subtract, multiply, and not using --use_fast_math
But shouldn’t the choice of which get sent be deterministic? Same for the reduction. The same input array should reduce to the same sum, since the sequence of parallel operations should remain the same (even if the relative timing does not), and commutativity should hold in IEEE arithmetic. Changing the order of the input array will naturally change the final sum, since associativity does not hold.
I don’t know the implementation of sqrtf on the gpu, but some implementations use a random initializer. Could be that it may change the list significant bit
MULs can be calculated either on SPs or on SFUs. It’s up to the scheduler to figure out where to route a MUL instruction and it’s dynamic. There might be differences in implementations.
SFUs also have native RCP and RSQ, so they might calculate your sqrts. SPs can also do this in software. I’m not sure whether the scheduler is free to do sqrts in both places without explicitly doing --fast-math or something. They might also take on divisions (with RCP).
So it’s likely that some instructions are handed to SFUs from time to time. This would be indeterministic.
Right about the MUL(GT200 only). But should not both MUL units behave the same?
How division/sqrt is implemented is decided at compile time. The hardware scheduler only sees the resulting instruction sequence(rcp/rsqrt/mad).
On Fermi, atomicAdd(float) might cause indeterministic order.
(And just for completeness: if a write occurs to the same memory location from different threads at the same time, it is undefined which of these values will actually be stored. So yes, it could be a bug in your program).
Thank you for all your answers, it seems I must check the source code again.
I don’t think I have concurrent writes to the same memory location (the code is not supposed to do this kind of things) , but… I’ll pay a particular attention to that, thanks.
Or perhaps a missing __syncthreads() in a very vicious place… ! External Media
Those are possibilties I hadn’t considered (I’d argue that atomics come under the general heading of ‘race conditions’ but they absolutely merit separate mention). It would be nice to see a specific statement from NVIDIA.
I don’t know really. They’re different pieces hardware and it’s possible they behave differently, but this could only be answered from someone who’s either an NVIDIA engineer or has made extensive microtests.
I usually just assume floats are wild, untamed beasts and whatever happens to the least significant digits is up to the moon cycle. And that == for floats is pretty much invalid C++ :)
Ooh, those can be evil bugs. I had one in my last code (neural network), took me a couple of hours of staring in the source, debuggers being useless.
Perhaps a good debugging strategy is to make a syncthreads after every shared memory access (read or write) and if that works, keep commenting them out until something breaks.
What I have seen when comparing CUFFT results to Matlab results is a relative error of somewhat near 10e-7 for single precision and 10e-14 for double precision. Sadly even with a GTX 480 this didnt significantly improve over the old GT200 despite of Fermi cards being conformer to IEEE. I heared one possible reason would be the sin/cos (hardware) implementations on the GPU. Doing a couple of FFTs in my algorithm has shown that results differ from Matlab results by more than 1% in single precision so I see CUFFT as a real problem regarding its accuracy. Maybe it could help if they provide some functions to which you could pass sin/cos depending values you have pre-calculated by CPU (e.g. for plan creation).
What do u mean with -7 and -14 are identical? I mean with SP i often saw errors like at the 7th decimal place and with DP at the 14th decimal place in comparison to CPU results (which depend on a different FFT implementation so this is only a rough comparison oc).
This error of 1% I dont get from a single call oc. Its after a couple of FFTs of smaller sizes, each 1d C2C and some other maths in SP.
What I mean is, errors of this magnitude with GPUs dont look that strange to me.
What I mean is that single precision answers which differ at the seventh decimal place should be considered identical. Similarly for double precision answers differing at the 14th decimal place. If this seems a surprise, do some reading.
Ok, bug found after hours of investigation !
I have a loop that iterates along Z, with 18 syncthreads, but I missed the one just before the end of the iteration… shame shame
Thank you, this topic is very instructing (and especially the link “What Every Computer Scientist Should Know About Floating-Point Arithmetic”, highly recommendable ! thanks a lot)