Numerical differences between IPP and CUDA code, Solving Linear system

I’m trying to solve a Least square problem using a linear system in the form of

  Ax = b  

Where I have a wight diagonal matrix, which changes my problem to the form of:

 WAx = Wb

I developed the code using Intel IPP and CUDA implementation, CC 2.0 no npp is in use. [I’m definite that there is no bug in the CUDA code].

The IPP works great, extremely accurate, no doubt about it.

Regarding to the CUDA code , the code works great for small matrices, However when I work with matrices size of n > 1e3 , I receive an unbearable numerical differences of ± ~~ 0.5 which may cause a significant distortion in the future.

I’d like to know how to avoid such a numerical differences? How could it be that IPP is so much accurate ?

I’m using double precision for both applications,

Are there any CUDA flags relevant to high accuracy computations ?

I’m using a lot of multiply + additions operation, I tried to set -fmad flag to false, it didn’t change anything.

** I can’t post the code.

Any help would be very appreciated