What is the floating point precision on CUDA? What standard does it follow? I have a bit of code which does some accumulation of values and I am getting quite different results between CPU and emulation CUDA code and the error is really then blowing up.
Does anyone know of some standard practices to mitigate this?
Can you be more specific? GPU arithmetic is supposed to be (largely) IEEE compliant. And if you’re running in emulation, then you’re using the CPU’s floating point representation anyway, so the GPU’s FP units are irrelevant. Based on the vague statement
I’m guessing that you’re summing all the elements in an array?
To avoid round-off error in long sums, there are two options: use double precision, or use Kahan summation. Double precision requires a compute capability 1.3 device or greater, and you have to pass the -arch sm_13 flag to the nvcc compiler or the doubles will be automatically downgraded to floats.
Kahan summation is a standard trick (see Wikipedia) that uses two floats, one for the sum and one for accumulated error on the sum. On most CUDA devices, except the C2050, this should be about twice as fast as using doubles for accumulation, although it is a little harder to read.
Thanks for the tip on the Kahan summation. However, wouldn’t the multiplication bit in my kernel also generate significant errors? I think the significant error in my case is coming from the multiplication of the two floats when one or both of the floats are quite small.
Other options include changing the order of summation (like sort and sum from smallest to biggest) or use a reduction method which will also reduce error by tending to keep summed terms approximately the same order of magnitude.
Multiplication in floating point grows relative error at a low and predictable rate (unless you are over/underflowing). It is addition of two values with different magnitudes that grows the error very fast, and this is often the case in long sums.