long live the compiler

the compiler seems to struggle with:

dbl1 = (dbl2 - (dbl1 * dbl3)) / (1 - (dbl1 * s_dblx[2]));

i have to break it down in 2 parts and introduce an additional variable, to obtain the correct result

dbl4 = 1 / (1 - (dbl1 * s_dblx[2]));
dbl1 = (dbl2 - (dbl1 * dbl3)) * dbl4;

cuda 6.5
the above is device code; i suppose it is a bit much for the compiler to bite properly
i now conclude the compiler has infinite wisdom

i suspect that it may be because the destination is also a source more than once

Can you show a compilable and buildable example? Did you use a higher precision reference computation to establish correctness? Since the expressions map well to FMA, which may guard against subtractive cancellation, this may be a case where GPU computation with FMA delivers a more accurate result than the equivalent computation without FMA. You can force the latter by specifying -fmad=false on the nvcc command line.

i am busy debugging; that should perhaps be kept in mind - everything seems to be relative when debugging

for the same input values given to the host, and to the device, the same code on the host and device do not give the same result, except when i brake up the equation on the device, as noted above

i am (moderately) busy, and do not wish to pay too much attention to something that i can work around, right now
but, it took some time to establish this as a point of code departure, reminding me not to take anything for granted
this should be easily reproducible - the same equation in a test kernel would either give the correct or wrong value; either the compiler would get the equation right or wrong
perhaps later i would establish whether i can reproduce the case in a separate test kernel

I am a little worried about the use of integer constants in either version.
1.0 would seem safer to me.

dbl1 to dbl4 are local variables of type double
s_dblx is shared memory of type double

why double?
because some things just present better when packaged together…

You seem to be imply that unless the result from the GPU matches the result from the host computation, it is “incorrect”. I think it is well possible that there is no issue with the compiler here, other than that it contracts the numerator and divisor expressions into a single FMA each, which in turn likely improves the accuracy. If so, the numerical difference you are seeing may be well justified and your “workaround” would actually force a numerically inferior result.

I have analyzed numerous cases of alleged “incorrect” results from the GPU before, and would be happy to analyze the above expressions if supplied with real-life data for each of the operands (for double precision, you would want to print them with “% 23.16e” to capture the data unambiguously).

again, i am debugging, and i really need to confirm that i can reproduce this, via a test kernel

but, what strikes me is that, in order to get the original equation right, the compiler very likely needs to spawn or instigate 2 registers/ local variables, instead of just 1
i wonder as to its prowess…
no doubt in my mind that it can spawn a temporary working register, but 2…?

with regards to your remark on relative correctness:
the application solves a non-linear problem via coefficients
its is simpler/ cheaper to only store the coefficients, not the solution, as the former is shorter than the latter, and as the former expands into the latter
hence, when all is set and done, the host takes the coefficients of the final solution, and expands it
but when the host does that, it does not match the criteria the device used as part of its calculation
for example: the device pushes set x as the solution coefficient set, stating that its sum or some other criteria is y; but when the host takes the solution, and expands it in the same manner, it no longer gets y
i really think it is reasonable to expect the host and device to more or less yield the same result
and the error is significant

I wasn’t able to discover any problem or discrepancy in the first 10 decimal digits or so, between host and device computed results with a test case built around what you have shown in this posting. (Unless you are in fact looking for bit-wise identical mantissae between device and host. Even if you were looking for bit-wise identical mantissae, I doubt the code transformation you are proposing as a fix would have any bearing.) I think the problem is likely in something you haven’t shown.

i have by now more or less finished debugging, and have now written a test kernel

i can not reproduce the case; hence it must have been a case of post debugging hysteria caused by conventional debugging relativity

indeed, long live the compiler