floating point processor of GPUs

Hi All, I know that the recent generation of GPUs are IEEE-compliant and can support double precision. However, I still wonder the details of the floating point processor of GPUs. For the FPU in x86 processor, underlying 80-bit may be used for floating point numbers. I do not find the technique details of FPU on GPUs, it actually also implements 80-bit based double precision?

This problem is from my recent work. In some cases (although the probability is very very low), the floating point arithmetic operators on the GPU have different behaviors compared with CPUs. On the CPU, I have forced the double precision to use 64-bit rather than 80-bit.

Additionally, I also think this problem is related to the compiler. My program do not meet such problems while using CUDA 2.0, but has problems in all later CUDA versions.

I have not figure out a way to separate this problem from my complicated program, thus i am sorry for the long and boring text…

GPUs have a Fused Multiply-Add instruction in double precision, which is more accurate than a multiplication followed by an addition. The compiler can replace (a * b) + c sequences by FMAs, which can return a different answer.

There is no such thing as extended 80-bit precision on GPUs.

Appendix G.2 of the programming guide has all details about IEEE-compliance.

Thanks for your answers. I know the FMAs, and I have already avoid such things in both GPU and CPU. I think this is not the problem in my program. Let my try to simplify my program and post some code later.

If you have looked into it as far as you have, I’m sure you know that the double precision floating point is not 100% IEEE standard, as in the CPU I believe it is required to be. On man iterations of CG in DP I also get differing results though have not found an example yet that completely confuses the GPU numerically, yet. I’d bet they exist. The new architectures should be much better but I don’t believe even fermi is 100% IEEE floating point format.

Actually, IEEE-754(2008) support on Fermi is already much better than on most CPUs, especially x86 (even with SSE).

Fermi supports FMA in single and double precision and has hardware support for subnormals. Current x86 don’t.

Fermi still lacks support for flags and exceptions however.

The architectures that I think offer “better” floating point support are the AMD Evergreen architecture (another GPU, strange enough!) (supports unfused multiply-add and lots of other fused or unfused ops in addition to FMA, and supports IEEE flags), and IBM Power6/7 (supports decimal floating point and lots of exotic rounding modes).

I think that few, if any, CPUs are fully IEEE compliant either. Poking through compiler man pages usually turns up an option which states something like “Enforces full IEEE compliance on all operations. Will make code glacially slow.” AIUI, getting pretty good IEEE compliance in hardware is (relatively) easy, but sorting out all the nooks and crannies of the standard is quite difficult and costly. Since these nooks and crannies don’t affect most code, hardware designers don’t bother, leaving it to the compiler writers to sort out in software.

Commutative operators like addition and multiplication are “not” commutative when it comes to floating point operations on computers.
Because computer memory is limited and truncation happens on every operation… (just like how a modulus operation enforces ordering…)

Parallel decomposition changes the order of operations and hence the results. Even normal reductions performed on a large data set can give widely varying answers.

Does anyone know of a CUDA library for decimal floating-point computations