Can you show a compilable and buildable example? Did you use a higher precision reference computation to establish correctness? Since the expressions map well to FMA, which may guard against subtractive cancellation, this may be a case where GPU computation with FMA delivers a more accurate result than the equivalent computation without FMA. You can force the latter by specifying -fmad=false on the nvcc command line.
i am busy debugging; that should perhaps be kept in mind - everything seems to be relative when debugging
however:
for the same input values given to the host, and to the device, the same code on the host and device do not give the same result, except when i brake up the equation on the device, as noted above
i am (moderately) busy, and do not wish to pay too much attention to something that i can work around, right now
but, it took some time to establish this as a point of code departure, reminding me not to take anything for granted
this should be easily reproducible - the same equation in a test kernel would either give the correct or wrong value; either the compiler would get the equation right or wrong
perhaps later i would establish whether i can reproduce the case in a separate test kernel
You seem to be imply that unless the result from the GPU matches the result from the host computation, it is “incorrect”. I think it is well possible that there is no issue with the compiler here, other than that it contracts the numerator and divisor expressions into a single FMA each, which in turn likely improves the accuracy. If so, the numerical difference you are seeing may be well justified and your “workaround” would actually force a numerically inferior result.
I have analyzed numerous cases of alleged “incorrect” results from the GPU before, and would be happy to analyze the above expressions if supplied with real-life data for each of the operands (for double precision, you would want to print them with “% 23.16e” to capture the data unambiguously).
again, i am debugging, and i really need to confirm that i can reproduce this, via a test kernel
but, what strikes me is that, in order to get the original equation right, the compiler very likely needs to spawn or instigate 2 registers/ local variables, instead of just 1
i wonder as to its prowess…
no doubt in my mind that it can spawn a temporary working register, but 2…?
with regards to your remark on relative correctness:
the application solves a non-linear problem via coefficients
its is simpler/ cheaper to only store the coefficients, not the solution, as the former is shorter than the latter, and as the former expands into the latter
hence, when all is set and done, the host takes the coefficients of the final solution, and expands it
but when the host does that, it does not match the criteria the device used as part of its calculation
for example: the device pushes set x as the solution coefficient set, stating that its sum or some other criteria is y; but when the host takes the solution, and expands it in the same manner, it no longer gets y
i really think it is reasonable to expect the host and device to more or less yield the same result
and the error is significant
I wasn’t able to discover any problem or discrepancy in the first 10 decimal digits or so, between host and device computed results with a test case built around what you have shown in this posting. (Unless you are in fact looking for bit-wise identical mantissae between device and host. Even if you were looking for bit-wise identical mantissae, I doubt the code transformation you are proposing as a fix would have any bearing.) I think the problem is likely in something you haven’t shown.