accuracy of fp division


I’m running into problem that seemingly shows that the accuracy of single precision floating-point
division on CUDA is lower than that of CPU…

Is this really true ? and if yes are there any means to deal with that ?


Yes, this is documented in the programming guide (P86) - single precision division is only accurate to 2 ulp (this is because the GPU implements division using reciprocal).

One work-around is to use double precision (if you GPU supports it), which is fully compliant.

This could explain why my gpu gmres solver converges in more iterations than the cpu version. Loss of orthogonality due to lower precision…

I have no idea how to test the hypothesis, though…

Maybe compare to emu mode?

ok thanks for reply. I have G200 but double precision div should be real slow since it does not have hardware support,

so i have to look for other options…


What would be the other work-arounds beside doubles?



One step of Newton iteration might polish the final bits. This is certainly possible for computing 1/Z , but I imagine it could be extended for Y/Z.

The iteration

x2= x1*(2.0-Z*x1)

polishes the computation of X, converging to 1.0/Z.

FYI, in CUDA 2.2 we are planning on adding some new device functions that will provide IEEE-compliant single precision reciprocal, square-root, and division.

Note that these will be much slower than the built-in operations, but may be useful for developers who need to match CPU results exactly.

Hi Simon,

Thanks for the information, can you please suggest a workaround for now for sqrt accuracy?

I would try using some kind of iterative method to improve the accuracy of sqrt, as SPWorley suggested:…ng_square_roots