wrong floating point operation results

I have a recursive floating point calculation done inside a kernel that is producing the wrong results, same code as the cpu version, different results up to a relative error of 1e-3 which is way too much and produces a lot of noise in the output.

Any idea on what may be causing this? I’m suspecting mixed integer float operations issues but not sure
fast math is not enabled.

One of the functions showing the problems (the shorter one) (each thread has it’s own global memory buffer for the calculation and thus the use of step here)

device void genlgp(float theta, int nc, float *pnmllg, size_t step)
{
float costh = cosf(theta);

pnmllg[0] = 0.0f;
pnmllg[step] = 1.0f;

for (int n = 2 ; n < nc ; n++)
	pnmllg[n*step] = ((2.0f*n - 1.0f)*costh*pnmllg[(n - 1)*step] - n*pnmllg[(n - 2)*step])/(n - 1.0f);

}

Glad for any suggestions

thanks

The obvious candidate in that code is the cosine. There is range and ULP data for all the CUDA math library functions in the programming guide. You should check those (and any other math library functions you are using for that matter) against your input data. You should also check whether your host code is actually using single or double precision versions of the same functions. Your host code may well be doing intermediate calculations in double precision, and you just don’t realise it.

moved this to be a reply, sorry … :">

I did some more testing and it’s not cosf (which returns the binary same number), it’s the floating point operations.

Find in the documentation:

*Division is implemented via the reciprocal in a non-standard-compliant

way;

  • Square root is implemented

Changing the device code from

pnmllg[nstep] = ((2.0fn - 1.0f)costhpnmllg[(n - 1)*step] + (-n)*pnmllg[(n - 2)*step])/(n - 1.0f);

to this

__fdiv_rn((2.0f* __int2float_rn(n) -1.0f)costhpnmllg[(n - 1)step] - npnmllg[(n - 2)*step], __int2float_rn(n - 1));

got the error down, i.e where the host returns -1181.809082 for the last element in the chain, the device initially returned -1181.751343 and now it returns -1181.795654 which is closer to the target, but still not there, although nothing I did seemed to get it any closer.

Surprizingly trying to attache the minus to the n which works fine on the host, i.e

__fdiv_rn((2.0f* __int2float_rn(n) -1.0f)costhpnmllg[(n - 1)*step] + (-n)*pnmllg[(n - 2)*step], __int2float_rn(n - 1));

got an amazingly wrong result of -1181.429077

still looking for that last non ieee compliant bit that will take me the rest of the way though.

As far as I can tell the problem is with

abc or abc - d

for some values is different between the CPU and GPU. From the results that happens for quite a few values.

Also -n*a where n is int and a is float and (-n)*a produces significantly different results also in quite a few instances.

Maybe truncated multiply-and-add?

What happens if you replace every (or some) multiplication by __fmul_rn and every addition by __fadd_rn?

(I mean, aside from making the code completely unreadable ;))

Yeah I would guess that is a multiply-add combination issue. I would suggest explicitly casting the integer to a float outside of the expression (in both codes), and then using the functions Sylvian suggested to force the compiler to issue separate multiply and add operations.