long live the compiler

little_jimmy · April 10, 2015, 5:48am

the compiler seems to struggle with:

dbl1 = (dbl2 - (dbl1 * dbl3)) / (1 - (dbl1 * s_dblx[2]));

i have to break it down in 2 parts and introduce an additional variable, to obtain the correct result

dbl4 = 1 / (1 - (dbl1 * s_dblx[2]));
dbl1 = (dbl2 - (dbl1 * dbl3)) * dbl4;

cuda 6.5
the above is device code; i suppose it is a bit much for the compiler to bite properly
i now conclude the compiler has infinite wisdom

i suspect that it may be because the destination is also a source more than once

njuffa · April 10, 2015, 8:04am

Can you show a compilable and buildable example? Did you use a higher precision reference computation to establish correctness? Since the expressions map well to FMA, which may guard against subtractive cancellation, this may be a case where GPU computation with FMA delivers a more accurate result than the equivalent computation without FMA. You can force the latter by specifying -fmad=false on the nvcc command line.

little_jimmy · April 10, 2015, 8:33am

i am busy debugging; that should perhaps be kept in mind - everything seems to be relative when debugging

however:
for the same input values given to the host, and to the device, the same code on the host and device do not give the same result, except when i brake up the equation on the device, as noted above

i am (moderately) busy, and do not wish to pay too much attention to something that i can work around, right now
but, it took some time to establish this as a point of code departure, reminding me not to take anything for granted
this should be easily reproducible - the same equation in a test kernel would either give the correct or wrong value; either the compiler would get the equation right or wrong
perhaps later i would establish whether i can reproduce the case in a separate test kernel

cbuchner1 · April 10, 2015, 8:48am

I am a little worried about the use of integer constants in either version.
1.0 would seem safer to me.

little_jimmy · April 10, 2015, 8:58am

dbl1 to dbl4 are local variables of type double
s_dblx is shared memory of type double

why double?
because some things just present better when packaged together…

njuffa · April 10, 2015, 9:15am

You seem to be imply that unless the result from the GPU matches the result from the host computation, it is “incorrect”. I think it is well possible that there is no issue with the compiler here, other than that it contracts the numerator and divisor expressions into a single FMA each, which in turn likely improves the accuracy. If so, the numerical difference you are seeing may be well justified and your “workaround” would actually force a numerically inferior result.

I have analyzed numerous cases of alleged “incorrect” results from the GPU before, and would be happy to analyze the above expressions if supplied with real-life data for each of the operands (for double precision, you would want to print them with “% 23.16e” to capture the data unambiguously).

little_jimmy · April 10, 2015, 9:26am

again, i am debugging, and i really need to confirm that i can reproduce this, via a test kernel

but, what strikes me is that, in order to get the original equation right, the compiler very likely needs to spawn or instigate 2 registers/ local variables, instead of just 1
i wonder as to its prowess…
no doubt in my mind that it can spawn a temporary working register, but 2…?

with regards to your remark on relative correctness:
the application solves a non-linear problem via coefficients
its is simpler/ cheaper to only store the coefficients, not the solution, as the former is shorter than the latter, and as the former expands into the latter
hence, when all is set and done, the host takes the coefficients of the final solution, and expands it
but when the host does that, it does not match the criteria the device used as part of its calculation
for example: the device pushes set x as the solution coefficient set, stating that its sum or some other criteria is y; but when the host takes the solution, and expands it in the same manner, it no longer gets y
i really think it is reasonable to expect the host and device to more or less yield the same result
and the error is significant

Robert_Crovella · April 10, 2015, 6:19pm

I wasn’t able to discover any problem or discrepancy in the first 10 decimal digits or so, between host and device computed results with a test case built around what you have shown in this posting. (Unless you are in fact looking for bit-wise identical mantissae between device and host. Even if you were looking for bit-wise identical mantissae, I doubt the code transformation you are proposing as a fix would have any bearing.) I think the problem is likely in something you haven’t shown.

little_jimmy · April 12, 2015, 11:27am

i have by now more or less finished debugging, and have now written a test kernel

i can not reproduce the case; hence it must have been a case of post debugging hysteria caused by conventional debugging relativity

indeed, long live the compiler

Topic		Replies	Views
FMA precision issue CUDA Programming and Performance	9	19531	November 21, 2010
GPU Code and CPU Code output not matching till machine precision (i.e. 13 decimals places) CUDA Programming and Performance	22	1092	August 9, 2023
Why are the calculations different between CPU and GPU? CUDA Programming and Performance	2	917	February 7, 2020
Cube computing difference in GPU and CPU? CUDA Programming and Performance	4	592	November 1, 2017
Accuracy problem I'd even say inaccuracy ... CUDA Programming and Performance	6	2852	June 28, 2008
Precision Fail CUDA Programming and Performance	5	10581	March 10, 2011
CPU and CUDA code yield different results? CUDA Programming and Performance	3	1186	June 28, 2013
[resolve]It will be wrong to divide the result by using the variable CUDA Programming and Performance	4	1019	August 19, 2014
Double Precision errors Legacy PGI Compilers	5	2671	June 12, 2018
Why do I have the problem of different results every time when I use CUDA for calculations？ CUDA Programming and Performance	5	299	July 24, 2023

long live the compiler

Related topics