I have a recursive floating point calculation done inside a kernel that is producing the wrong results, same code as the cpu version, different results up to a relative error of 1e-3 which is way too much and produces a lot of noise in the output.
Any idea on what may be causing this? I’m suspecting mixed integer float operations issues but not sure
fast math is not enabled.
One of the functions showing the problems (the shorter one) (each thread has it’s own global memory buffer for the calculation and thus the use of step here)
The obvious candidate in that code is the cosine. There is range and ULP data for all the CUDA math library functions in the programming guide. You should check those (and any other math library functions you are using for that matter) against your input data. You should also check whether your host code is actually using single or double precision versions of the same functions. Your host code may well be doing intermediate calculations in double precision, and you just don’t realise it.
got the error down, i.e where the host returns -1181.809082 for the last element in the chain, the device initially returned -1181.751343 and now it returns -1181.795654 which is closer to the target, but still not there, although nothing I did seemed to get it any closer.
Surprizingly trying to attache the minus to the n which works fine on the host, i.e
Yeah I would guess that is a multiply-add combination issue. I would suggest explicitly casting the integer to a float outside of the expression (in both codes), and then using the functions Sylvian suggested to force the compiler to issue separate multiply and add operations.