The most likely cause for the difference between the two instruction sequences is MAD/FMA contraction performed by the compiler as an optimization. You can check the generated machine code with cuobjdump to verify / falsify this hypothesis. As both Open64 and PTXAS can perform this optimization it is insufficient to just look at the intermediate PTX.
Basically, I suspect that
mfl = vlplUAlr + vrmlUAcl
is translated into
mfl = {mad|fma} (vlpl, UAlr, vrml*UAcl)
If so, for both MAD and FMA the product vlplUAlr is formed in a different way than the product vrmlUAcl, which explains why their sum is not zero. For MAD [only applies to single precision] the product vlplUAlr would be truncated to single precision prior to the addition. With FMA [single precision on Fermi, and double precision] the product vlplUAlr is computed to twice the native precision, and enters the addition in that format. By contrast vrml*UAcl is rounded to native precision prior to the addition.
You can use intrinsics to locally disable MAD/FMA contraction, by re-writing the code as follows:
mfl = __fmul_rn(vlpl, UAlr) + __fmul_rn (vrml, UAcl); // single precision
mfl = __dmul_rn(vlpl, UAlr) + __dmul_rn (vrml, UAcl); // double precision
In general, comparing CPU and GPU results directly is not a good way of establishing whether the GPU results are correct (by some definition of correct). I would recommend comparison with a higher precision reference to make that call. I use that approach extensively in my own work. I find that often, the GPU results are more accurate, while different from the CPU results.
NVIDIA has a new whitepaper out that addresses some of the discrepancies in floating-point computation between CPU and GPU which you may find useful. The author welcomes feedback. Get it here:
http://developer.nvidia.com/content/everything-you-ever-wanted-know-about-floating-point-were-afraid-ask