Different output from emulation and device precision issues on GPU vs CPU

My code executed in Emulator mode gives different floating point results from the kernels run in Debug mode.

The output is the same decimal digits until 7th, then it starts to get different after that. Also, if the number is very small, then it is just messed up on debug mode. Here are some examples:

device:

[codebox]

0.33982921     0.00000000     0.00000000     0.00000000    -0.51045752    -0.55610180     0.00000000     0.00000000     0.00000000

0.00000000     0.40824831     0.00000000     0.00000000     0.00000000     0.00000000     0.00000000     0.00000000     0.00000000

0.00000000     0.00000000     0.40824828     0.00000000     0.00000000     0.00000000     0.00000000     0.00000000     0.00000000

0.00000000     0.00000000     0.00000000     0.50000006     0.00000000     0.00000000     0.00000000     0.00000000     0.00000000

0.00000013     0.00000000     0.00000000     0.00000000     0.73669380     0.06317040     0.00000000     0.00000000     0.00000000

0.00000012     0.00000000     0.00000000     0.00000000    -0.00000019     0.73939711     0.00000000     0.00000000     0.00000000[/codebox]

emulator:

[codebox]

0.33982915     0.00000000     0.00000000     0.00000000     -0.51045758    -0.55610168     0.00000000    0.00000000      0.00000000

0.00000000     0.40824831     0.00000000     0.00000000      0.00000000     0.00000000     0.00000000     0.00000000      0.00000000

0.00000000     0.00000000     0.40824828     0.00000000      0.00000000     0.00000000     0.00000000     0.00000000      0.00000000

0.00000000     0.00000000     0.00000000     0.50000006      0.00000000     0.00000000     0.00000000     0.00000000      0.00000000

8.9406967e-008  0.00000000    0.00000000    0.00000000      0.73669374     0.063170217   0.00000000     0.00000000     0.00000000 

2.3841858e-007  0.00000000    0.00000000    0.00000000     -1.4901161e-008  0.73939717   0.00000000     0.00000000     0.00000000[/codebox]

I’m using GTX 280 on Windows XP Pro x64 edition, and the problem has been compiled with 'nvcc -arch=sm_13" option.

I read a posting from this forum who had similar problem (http://forums.nvidia.com/index.php?showtop…mp;hl=precision), and added -Xopencc -O0 to compile option as he did, but nothing is changed.

I also found an article about this problem (http://www.bv2.co.uk/?p=910#more-910) which says this difference happens because all operations on a float occur at 32 bits in CUDA, but CPU (which means emulator) uses 80 bit precision for float point operations. Then I understand that small difference, for example 0.33982915 becomes 0.33982921, would be in this case. However I don’t understand why some very small numbers such as 8.9406967e-008 becomes a weird number like 0.00000013. Is it something related to overflow and underflow?

The article says CPU/FPU’s internal precision/rounding settings should be changed to use 32bits but I don’t know how to change it. Even worse, it says this method only mitigates the difference but still does not give the same value.

Any ideas or advice will be very appreciated.

This is nothing but the expected difference in rounding between 80-bit x87 and 32-bit single precision. Why is it a problem that they don’t necessarily match?

Short of going over the topic yet again, I’ll just post a link…

http://docs.sun.com/source/806-3568/ncg_goldberg.html

Put simply, floating point operations between devices (be it between a CPU/GPU/DSP/whatever, even Pentium 4 vs. Core 2 duo differ) are not guaranteed to be identical - and such an assumption can ‘never’ EVER be made when using IEEE 754 floating point numbers.

It’s theoretically possible to get identical results, between identical devices (eg: Core2 Duo, exact same make/model) if you setup the floating point environment appropriately… but it would force you to a very specific set of hardware.

Are you using double-precision on the device?

Look for the -Xcompiler option in nvcc and your compiler documentation (http://msdn.microsoft.com/en-us/library/aa289157(VS.71).aspx for MSVC).

If you are compiling in 64-bit mode, your compiler probably already uses SSE instructions, which are not affected by the rounding control flag (so no 80-bit or 64-bit internal calculations).

Oh, I thought that IEEE-754 was created for the sole purpose of making this assumption hold. ;)

The problem is that compiler vendors thought it would be smart to disable IEEE-compliance by default, and require the user to enable it back through obscure compilation flags… So that they can claim both high performance and IEEE-compliance… though not at the same time.

Well, assuming hardware conforms to IEEE-754 to the letter, the assumption can only ever hold if the two devices in question use identically sized FP registers, AND the final set of executed instructions - on both devices, are identical in both instructions - and order of instructions.

Both assumptions are rarely true unless dealing with same-architecture CPUs, running the same binary - with no kernel intervention of executed instructions.

Edit: To clarify, you are correct, that was the original intention of the spec - but it’s rarely true that the assumption holds due to compiler (as you said), software, and hardware differences.

Or equivalently “if the compiler is not broken”…

The only hardware issue I am aware of is the infamous x87 FPU and its deferred rounding and overflow detection (FMAs in PowerPC and Itanium are not a problem per se, you can just tell your compiler not to use it).

You are probably right that the problem is very common, because this “broken” instruction set is still the most widely-used…

Hopefully, migrating to 64-bit will force the adoption of the SSE and SSE2 instruction sets, which are comparatively clean and IEEE-compliant.

Actually, I found modern compilers complying fairly well. I had no arithmetic-related problems with gcc, icc and NVCC recently as long as I compiled in SSE mode, despite writing a lot of code that heavily depend on IEEE-754 compliance.

Maybe the main issue is the operations that are not covered by the standard, that is, everything except +, *, /, sqrt and conversions… So if you use any transcendental function like exp() or sin(), you need to provide your own libm for portability…

So the bottom line is: always compile with SSE and SSE2 enabled. Is anyone still using a pre-Pentium 4 or pre-Athlon 64 processor anyway?

I’m using float precision with just basic +, *, /, sqrt operations in the code.
The program is compiled in 64-bit mode so using SSE2 instructions.
Then is this means I should take this as the expected difference?

The problem of mine is that I’m using those values for further calculation, the differences are accumulated and finally generate unacceptable values sometimes.

Thank you very much for all your help.

OK. This is going to get ugly. ;)

Try with the following at the beginning of your program:

#include <emmintrin.h>

// workaround gcc bug 21408 and some MSVC bug

#define MM_DENORMALS_ZERO_ON		0x0040

int main()

{

	

	// Enable Flush-denormals-to-zero and Denormals-as-zero modes (SSE2)

	_mm_setcsr(_mm_getcsr() | (_MM_FLUSH_ZERO_ON | MM_DENORMALS_ZERO_ON));

...

}

And in device code, replace every division by a call to __fdiv_rn and every sqrtf by __fsqrt_rn.

Ideally, you should also replace every + with a __fadd_rn and every * with __fmul_rn…

(Well. Ideally, all this should be controlled by compiler flags, not hairy underscored instrinsics…)

If you still get different results than emulation mode, then I will start to believe what Smokey says. :thumbup:

Thank you Sylvain, I did what you suggested and I don’t have that weird value(e.g. 0.00000013) from very small numbers (e.g. 8.9406967e-008) any more which caused the accumulated error problem.
Still I can’t get the exactly same number but now the difference is somewhat allowable, much much better than before.

Thanks for your help again.

So you could now experiment a bit to find out which division or square root is causing the problem (__fsqrt_rn and __fdiv_rn are very slow, so it’s better to use them only where they are really needed).

Your algorithm seems particularly sensitive to rounding errors (numerically unstable), so try to make sure that the answers returned are meaningful, even on the CPU side…