CPU and GPU Floating point anomaly

Hi friends

We have a CPU C++ program giving floating point output A
A CUDA C++ port of the CPU C++ program was developed on TESLA C1060 GPU giving floating point output B .

We are getting outputs A and B matching upto only one digit after decimal point.

Please tell us what we should do to get better matching outputs of A and B for more than 6 digits after decimal point.

regards
Team

What is the precision of the GPU code?

double precision for GPU and CPU code

How many significant digits are in front of the decimal point? how many of these match?

You can get error accumulating when you add (or some other operation) two numbers a and b and a/b is of the same order as the precision. But for the double precision you would have to have about 1 billion operations to have so much error accumulating.

I would recommend reading the following whitepaper, if you haven’t had the chance to do so:

[url]http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf[/url]

Not knowing anything about the code other than that it is double-precision code, the most likely cause for numerical discrepancies between CPU and GPU would be the merging of double-precision multiplication and addition into double-precision FMA (fused multiply add). You can turn that off by passing -fmad=false to nvcc, but this will likely reduce the accuracy and performance of the GPU code. Generally speaking, the use of FMA typically improves accuracy by reducing rounding and providing some protection from subtractive cancellation.

I am not exactly sure what you mean by “matching to within one digit after the decimal point”. Can you show an example pair of results? How many digits are there altogether, and how many match?

Depending how big the numerical differences are, they could also be due to a bug in the code. Other than a careful review of the code, make sure that the code checks the status of all CUDA API calls and kernel launches, and run the program under cuda-memcheck. Please be aware that on an sm_13 device it will be able to provide only very limited checking due to hardware limitations.

Hello,

In this blog:
https://developer.nvidia.com/content/everything-you-ever-wanted-know-about-floating-point-were-afraid-ask?display[%24ne]=defaultcluster-managementtegra-hardware-sales-inquiriesrdp%2Fnsight-visual-studio-edition-registered-developer-program
there is a link to an article describing how the numbers are kept in gpu and how you can evaluate the rounding error for 1 operation, though if the rounding is the problem you would have to have many operations to get to such big difference.

In one of my codes there was a difference in the numbers from gpu compared to the cpu, after many hours spent on ‘fixing’ the gpu part it turned out that the cpu part had a mistake.

Another problem I had on the same code was the addition of 2 numbers (x position + size of system) -0.000xxx + 650.xxxx was giving on the gpu aloways 650.xxxx. This is in fact a precision problem since the small number divided by the large number is app the the same as the precision of float numbers. We fixed this by shifting the box with half of the size of the box so that we alwys add or substract number of similar size and the precision lost is minimum.

njuffa

GPU output
1.9148892009282

CPU output
1.9023344567543

The cuda program has only double precision code and CPU serial code too is only double precision data.

How am I to solve the anomaly in above outputs ?

if the code has operations like

a += b*(c+d)+r1*(h+4)*u+y(u+5)

My program in C++ as the following sample code

for(eln=0;eln<12;eln++)
{
vol[eln]delt0.5uub1[eln]b1[eln]+uvb1[eln]c1[eln]+uwb1[eln]d1[eln]+uvc1[eln]b1[eln]+vvc1[eln]c1[eln]+vwc1[eln]d1[eln]+uwd1[eln]b1[eln]+vwd1[eln]c1[eln]+wwd1[eln]*d1[eln])

rhsw[n4]=rhsw[n4]+(adv[4][1]+upw[4][1]+anuef*ak41)wvel[n1]+
(adv[4][2]+upw[4][2]+anuef
ak42)wvel[n2]+
(adv[4][3]+upw[4][3]+anuef
ak43)wvel[n3]+
(adv[4][4]+upw[4][4]+anuef
ak44)*wvel[n4]+sw;
dux=b1[eln]*uvel[n1]+b2[eln]*uvel[n2]+b3[eln]*uvel[n3]+b4[eln]*uvel[n4];
}

for(int i=1;i<=nodes;i++)
{

			uvel[i]=uvel[i]-delt*rhsu[i]/eml[i];

}

after porting to cuda

i am getting

GPU output
1.9148892009282

CPU output
1.9023344567543

difference

how to get matching results ?

Differences at this level suggest you have a numerical stability problem. I would first compare to a quad-precision CPU implementation to estimate the error of both of the double precision calculations. You may find the GPU calculation is more precision simply due to changes in the order of operations.

Unfortunately, then you need to think carefully about how numerical error accumulates in your equations. Summations and differences with many terms can rapidly reduce the precision of an answer.

Some of my colleagues used software as Mathematica where the precision could be set arbitrary high. This way they could check different to do the operations in order to minimize the rounding error.