the Lost of Accuracy using Cublas


I am converting my code from CPU to GPU. The original CPU code has been tested and gives correct results. However, I found that, because of the precision limitation (SP) of GPU, my codes fails in the GPU environment. The algorithm that I am working on is consisting some of the basic Blas 1 and Blas 2 operations such as: matrix vector vector, saxpy, norm_2 etc. The library that I am using is cublas, and the GPU card that I am running on is Tesla C-1060. Could someone here give me some points on how to compensate for the lost of the accuracy under GPU with SP precision?

I’d appreciate for your help!

In my experience, straight arithmetic operations and linear algebra can actually be a little more accurate on the GPU than standard IEEE single precision because it GPU can often issue combined multiply-add instructions which not suffer from intermediate truncation or rounding. The results from standard IA32 and (to a lesser extent) x86_64 single precision are pretty poor points of comparison, because single precision floating point is actually done in higher precision (64 or 80 bit) and the results rounded off to single precision.

If you have single precision code which works on your host CPU and fails on the GPU, it probably means your algorithm isn’t truly safe in single precision (or you have bugs). Your C1060 can do double precision too. You might want to investigate moving the critical sections of your code into double precision on the GPU, or alternatively adding a double precision “polishing” stage after initially approximating the solution/stage value in single precision.

Hi avidday,

Thanks for the replying!

So my original code on host CPU is in double precision,

Your “polishing” idea sounds interesting! Could you be a bit more specific or do you happen to have some reference to this? In my case, the critical section of my algorithm is actually several matrix vector multiplications (Mat-Vec), which is also the most expensive part of my algorithm. That’s why I am trying to convert those Mat-Vec’s from DP to SP to get a even faster speed-up results using C1060.

I am browsing some threads in the forum talking about Kahan summation algorithm, which I have zero experience on that. Do you think that would be applicable to compensate the loss of accuracy for Mat-Vec multiplication?


x87 based CPU math uses 80-bits internally… (much higher than what iEEE754 mandates).
SSE code uses only 64-bits as per IEEE754.
Take that into consideration… There r compiler options to force SSE. Check out

Best Regards,

That is a rather different proposition to what I thought you were asking about. Converting a double precision algorithm to single precision without a loss of accuracy is something akin to perpetual motion. The only real question is how much accuracy loss can your algorithm can safely tolerate.

This paper is a pretty reasonable starting point. It talks about doing the Linpack benchmark in mixed precision on the Cell processor.

It might help in some cases, depending on the relative magnitude of the components of the sum. It is also more than twice as slow as simple arithmetic.

Hi Sarnath,

Good point! Thanks for the reminder. That might also explain why my original DP algorithm gives way better accuracy. I am assuming x87 is independent of programming languages, right? Since my DP codes wasn’t compiled in C, it was Matlab script and compiled in Matlab.

Hi avidday


What a coincidence! Similar to the paper that you posted, I am also working on linear solvers, but instead of dense systems as illustrated as examples in the paper, I am more interested in large-scale systems and Krylov iterative solvers. The iterative refinement is absolutely doable, I have already implemented it and tested it in SP. The solution is completely comparable to solutions using pure Krylov solvers in DP! So the accuracy of the solution is very well preserved, however, the accuracy of all the intermediate results (such as the conjugate directions as in CG algorithm) are still far off from DP results unfortunately.

I am not sure if I understand this refinement algorithm completely, but I think it is essentially sort of Newton’s iteration for finding solution, which converges quadratically in most cases given a decent initial guess. So that might explain why this idea work so well here for the preservation of the accuracy of the solution.

It is an very interesting idea though, at least it gives you a way to get a good solution! But I think, I still need to find other ways for those intermediate results.

Thanks a lot for the suggestion!