wrong floating point operation results

laughingrice · February 13, 2010, 6:38am

I have a recursive floating point calculation done inside a kernel that is producing the wrong results, same code as the cpu version, different results up to a relative error of 1e-3 which is way too much and produces a lot of noise in the output.

Any idea on what may be causing this? I’m suspecting mixed integer float operations issues but not sure
fast math is not enabled.

One of the functions showing the problems (the shorter one) (each thread has it’s own global memory buffer for the calculation and thus the use of step here)

device void genlgp(float theta, int nc, float *pnmllg, size_t step)
{
float costh = cosf(theta);

pnmllg[0] = 0.0f;
pnmllg[step] = 1.0f;

for (int n = 2 ; n < nc ; n++)
	pnmllg[n*step] = ((2.0f*n - 1.0f)*costh*pnmllg[(n - 1)*step] - n*pnmllg[(n - 2)*step])/(n - 1.0f);

}

Glad for any suggestions

thanks

avidday · February 13, 2010, 8:37am

The obvious candidate in that code is the cosine. There is range and ULP data for all the CUDA math library functions in the programming guide. You should check those (and any other math library functions you are using for that matter) against your input data. You should also check whether your host code is actually using single or double precision versions of the same functions. Your host code may well be doing intermediate calculations in double precision, and you just don’t realise it.

laughingrice · February 13, 2010, 2:13pm

moved this to be a reply, sorry … :">

laughingrice · February 13, 2010, 2:15pm

I did some more testing and it’s not cosf (which returns the binary same number), it’s the floating point operations.

Find in the documentation:

*Division is implemented via the reciprocal in a non-standard-compliant

way;

Square root is implemented

Changing the device code from

pnmllg[nstep] = ((2.0fn - 1.0f)costhpnmllg[(n - 1)*step] + (-n)*pnmllg[(n - 2)*step])/(n - 1.0f);

to this

__fdiv_rn((2.0f* __int2float_rn(n) -1.0f)costhpnmllg[(n - 1)step] - npnmllg[(n - 2)*step], __int2float_rn(n - 1));

got the error down, i.e where the host returns -1181.809082 for the last element in the chain, the device initially returned -1181.751343 and now it returns -1181.795654 which is closer to the target, but still not there, although nothing I did seemed to get it any closer.

Surprizingly trying to attache the minus to the n which works fine on the host, i.e

__fdiv_rn((2.0f* __int2float_rn(n) -1.0f)costhpnmllg[(n - 1)*step] + (-n)*pnmllg[(n - 2)*step], __int2float_rn(n - 1));

got an amazingly wrong result of -1181.429077

still looking for that last non ieee compliant bit that will take me the rest of the way though.

laughingrice · February 13, 2010, 9:51pm

As far as I can tell the problem is with

abc or abc - d

for some values is different between the CPU and GPU. From the results that happens for quite a few values.

Also -n*a where n is int and a is float and (-n)*a produces significantly different results also in quite a few instances.

I did some more testing and it’s not cosf (which returns the binary same number), it’s the floating point operations.

Find in the documentation:

*Division is implemented via the reciprocal in a non-standard-compliant

way;

Square root is implemented

Changing the device code from

pnmllg[nstep] = ((2.0fn - 1.0f)costhpnmllg[(n - 1)*step] + (-n)*pnmllg[(n - 2)*step])/(n - 1.0f);

to this

__fdiv_rn((2.0f* __int2float_rn(n) -1.0f)costhpnmllg[(n - 1)step] - npnmllg[(n - 2)*step], __int2float_rn(n - 1));

got the error down, i.e where the host returns -1181.809082 for the last element in the chain, the device initially returned -1181.751343 and now it returns -1181.795654 which is closer to the target, but still not there, although nothing I did seemed to get it any closer.

Surprizingly trying to attache the minus to the n which works fine on the host, i.e

__fdiv_rn((2.0f* __int2float_rn(n) -1.0f)costhpnmllg[(n - 1)*step] + (-n)*pnmllg[(n - 2)*step], __int2float_rn(n - 1));

got an amazingly wrong result of -1181.429077

still looking for that last non ieee compliant bit that will take me the rest of the way though.

Sylvain_Collange · February 14, 2010, 5:14am

Maybe truncated multiply-and-add?

What happens if you replace every (or some) multiplication by __fmul_rn and every addition by __fadd_rn?

(I mean, aside from making the code completely unreadable ;))

avidday · February 14, 2010, 11:49am

Yeah I would guess that is a multiply-add combination issue. I would suggest explicitly casting the integer to a float outside of the expression (in both codes), and then using the functions Sylvian suggested to force the compiler to issue separate multiply and add operations.

Topic		Replies	Views
Accuracy problem I'd even say inaccuracy ... CUDA Programming and Performance	6	2852	June 28, 2008
CUDA innacuracy? CUDA float produces different result from CPU float CUDA Programming and Performance	8	3151	September 9, 2011
CPU and GPU floating point calculations Results are different CUDA Programming and Performance	6	22189	August 7, 2010
Precision issue! Wrong result for a multiplication CUDA Programming and Performance	7	1464	April 11, 2012
CUDA floating point CUDA Programming and Performance	4	2183	April 20, 2009
Why does device give wrong answer to simple math? CUDA Programming and Performance	5	2892	November 16, 2011
Division problem (weird behavior) CUDA Programming and Performance	23	18276	November 15, 2010
Accuracy in GPU floating point calculations CUDA Programming and Performance	35	8555	September 9, 2011
Error for parrallelise this c++ code CUDA Programming and Performance	1	609	August 18, 2015
Weird Float Arithmetic results in Return Code 30 CUDA Programming and Performance	1	4347	June 3, 2011

wrong floating point operation results

Related topics