 # Rounding error problems

Hello,

I am trying to do a molecular dynamics code and I am having some trouble with the precision. I am writing an Molecular Dynamics code. I am doing everything in double precision, but it seems the program gives non-zero output for configurations which should have 0.

I am testing my code with a 43 particles arranged in a hexagonal configuration. All of them are identical and the total force on the particle in the center should be zero.

For each pair the force is computed using the following code:

``````double dx,dy;
dx=xip-xjp;
dx=(dx<-llx*0.5?dx+llx:dx);
dx=(dx> llx*0.5?dx-llx:dx);

dy=yip-yjp[j];
dy=(dy<-lly*0.5?dy+lly:dy);
dy=(dy> lly*0.5?dy-lly:dy);

double rij2=dx*dx+dy*dy;

if(rij2<=9.0*hsq)
{
double rij=sqrt(rij2);
double q=rij/hg;
double dqkrij=quinticfirstderivative(q,cdqk);

double wp=cgd*(1.0/(rhoip)+1.0/rhojp)*dqkrij/rij;
xmyacc-=(wp*dx);
ymyacc-=(wp*dy);
}
``````

For the central particle the code is executed for all jp indeces.

Because the configuration is symmetrical the total force should be 0, but it is not. It 1.0e-10. This is large enough to make problems in a energy conserverving algorithm algorithm like the velocity Verlet.

The constant cgd is about 1.0e+6, the rhoip and rhojp 1.0e+3, while the factor dqkrij/rij is about 15.
The function quinticfirstderivative(q,cdqk) is a forth order polynomial a1*(3-q)^4+a2*(2-q)^4+a3*(1-q)^4.

When I put all this Fortran I get a similar behavior, however the forces are a few order of magnitude smaller.

When error the calcualtion depends on the cgd constant. The error decreases if I decrease it.

Is this kind of error normal for CUDA?

Is there anything I can do to make it of the order 1.0e-14 without changing the constants?

Can you show buildable and runnable code, e.g. including quinticfirstderivative(), together with a set of sample inputs and outputs for a case that you think should return zero mathematically? When you evaluate the computation in quadruple precision (or double-double), what kind of result do you get?

I suspect that what is happening is that there is subtractive cancellation occurring somewhere in the code and therefore tiny 1 ulp differences elsewhere in the computation ultimately lead to a result close to zero but not zero. That would mean we have poorly conditioned computation. If so, I would expect the closeness to zero in the outputs to vary depending on input data, and sometimes be smaller for your CUDA code and sometimes smaller for your Fortran code.

One aspect that can contribute to such 1 ulp differences in intermediate computation is the use of FMA (fused multiply-add) on the GPU, so for a quick experiment try compiling with -fmad=false. Note that the use of FMA typically results in more accurate results, in addition to higher performance.

Hello,

Thank you for your reply. Here is the code for the kernel:

``````__device__ double quinticfirstderivative(double q,double cnst)
{
double a1=((3.0-q)*(3.0-q))*((3.0-q)*(3.0-q));
double a2=((2.0-q)*(2.0-q))*((2.0-q)*(2.0-q));
double a3=((1.0-q)*(1.0-q))*((1.0-q)*(1.0-q));
double www=0;
www=(q<3?-5.0*a1:www);
www=(q<2?www+30*a2:www);
www=(q<1?www-75*a3:www);
return www*cnst;
}
``````

I tried the flag, but it did not seem to help.

I wanted to avoid posting the code, because it is huge and it has many other things, but here it is:

The Fortran folder is just to test the same configuration (some parameters might be different, but it is not relevant).

The relevant file is cudaroutines.cu The function computes the forces is gpu_gasgasforces(…). For the test this function is only called once for the initial configuration in the same line at around line 1650. The data is save in a file called newacc. For compiling there is a bash script with commands.

OK, now I see subtractions which could cause subtractive cancellation. It’s a bit curious that -fmad=false doesn’t make any difference. The basic arithmetic in both CPU and GPU is IEEE compliant, so should match. And with FMA-merging out of the way, the sequence of floating-point operations should match.

However, your Fortran compiler could transform the computation mathematically (something that Fortran allows but C/C++ frowns upon because floating-point arithmetic is not mathematics). Try setting the Fortran compiler for stricted possible IEEE compliance, e.g. for the Intel compiler this would be /fp:strict (Windows) or -fp-model strict (Linux). Does the Fortran code run on a platform that uses the x87 FPU for floating-point arithmetic? That could provide extra precision (more than double precision) for intermediate computation.

In order to locate the exact source of the numeric discrepancies, you will have to trace at least one “failing” test vector through all intermediate computations step by step.

Hello,

My cpus are just normal laptop/desktop 64 bit intel processors. I was under the impresssion that on cp all operations are done internally in 80 bit precision and then the results are rounded back.

If I play with fortran and use 128 bit precision all the forces become 1.0e-30, if I keep 64 bit precision it is 1.0e-12. For the same numbers in CUDA I get 1.0e-7.

I realize now that only source of errors come from the factor dqkrij/rij (or maybe another bug in the code), however I thought the error in computing dqkrij should not be too large since it is only 4th order polynomial. I tried to rewrite the code to change order of some operation, but without any result.

Could the compiler do some nasty optimization in the function? Is there a better way to compute it?

Also the each thread computes the force for one particle (for each particle ip f(ip)=f(ip)+fij(ip,jp)).
Would the results get better if I would save each pair interaction (ip,jp) and then do a reduction to get the total force? This would introduce an additional write+read operation to the global memory which would impact the performance and I would have to spend another few days to implement it.

Could a small difference in computing the inter particle distance create so big differences in the 4th order polynomial?

Maybe I still have a bug in the code, but both cuda-memcheck and cuda-gdb report no errors.

If you are on a 64-bit OS, mathematical computations on the CPU are by default done through the SSE/AVX units, and there is just IEEE-754 double precision, no extra precision. 80-bit extended precision was used as the default mode with the x87 FPU used on 32-bit x86 platforms.

Other than merging FMULs and FADDs into FMAs, the CUDA compiler performs no re-associations in floating-point computations that are not value preserving. By using -fmad=false you should be guaranteed that the computation is performed exactly as written.

On the other hand, compilers for the CPU, such as the Intel compilers, will often default to a mode that allows the re-ordering of floating-point computations for performance reasons. That is why I suggested forcing the host compiler into strict IEEE-754 compliance.

Unless I overlooked something, your code only comprises basic arithmetic operations (+, -, *, /, sqrt) that are governed by IEEE-754 and therefore the same sequence of operations should produce the same result on both CPU and GPU. This may suggest that the numerical differences arise in code outside of the code shown above.

As I said, you will need to do a step-by-step analysis of the computation to find out where differences creep in. Given the small amount of code involved, you should be able to do that in less than an hour.

I may be looking at the wrong code, but I noticed the following. Your Fortran code has:

``````wp=((prhoip/rhogas(ip)**2+prhojp/rhogas(jp)**2)*qfd(q,cstd))/rij
``````

``````double wp=cgd*(1.0/(rhoip)+1.0/rhojp)*dqkrij/rij;
``````

Not the same computation. I would suggest using exactly the same sequence of basic arithmetic operations on both platforms if you desire matching results. Mathematically identical expressions are rarely identical in finite-precision floating-point math.

Of course, once you go beyond basic arithmetic governed by IEEE-754 (e.g. transcendental functions), all bets are off.

Hello,

I tried different formulas and order of execution, unfortunately without almost any visible change.

HEllo,

Thank you for trying to help me. Your explanations helped me understand more about the rounding. It looks like I was comparing apple with pears. The Fortran code seemed to miss an extra multiplication with a constant which about 500. Once I put that in and run it again for the exactly the same parameters and configuration I got the similar size for the error. What is supposed to be 0 becomes 1.0e-7. This did no solve my problem, but at least I think now there is little chance that there is a bug in my code.

Doing the calculations with 128 bit precision decreases the error to 1.0e-26, which is acceptable if I do the rest in 64 bit precision. However I think this is not possible in CUDA.

Now my only hope is that for the real runs the forces in the system are large enough so that this error does not affect the results of our simulations.

I will remember this. CUDA and CPU codes will give the same result up for normal operations up to machine precision.

I think it would be worthwhile to track down the source of the discrepancies in the interest of sound science.

If you can extract a minimal example into two standalone programs that demonstrate the issue and that I can build and run, I’d be happy to take a look. I have the 2013 Intel Fortran compiler, and I have CUDA 8.0.

Hello,

In both CUDA and Fortran I tried to see which one gives the error if take a way the large prefactor. The error is reasonable small between 1.0e-13 and 1.0e-16. But when multiply by the prefactor it becomes 1.0e-7 to 1.0e-9. This is a test only to see what happens if I put a configuration which would give the total force equal to 0. In practice I suppose the force is quite large and other terms are present, but I except that even if my integration algorithm contains no noise, the trajectories of the particles will be different on differnt computers or different runs.

Both CUDA and Fortran codes from the dropbox link are fixed for the test. They have the same parameters and the same configuration for the particles. It would take a few hours to clean the cuda code, but the rest of the code does not affect the test, it is commented.

I supposed that if the Fortran code could be changed to reduce the error, without using lon-long it would be applicable to the CUDA code as well.

The CUDA code has the compilation command in a script COMPILE.GLdemo (nothing special, only that it includes several files and paths). The program initializes the variables, makes all the necessary allocations, and creates the first configuration. This is done with the function gpusphinit(…) from the file cudaroutines.cu . The fucntion also calculations once the force and then it saves it into the file ‘newacc’. The forces are the second and third last columns.

The fortrant code is very simple. No special flags for compiling, I used gfortran. The forces are saved in a text file ‘forces.txt’.

I guess I expressed myself poorly. When I mentioned “minimal” and “standalone” I envisioned two single-file programs (one CUDA, one Fortran), each about 100 lines in size, posted to this thread. If need be, a small data file of similar size for the input data. In other words, what Stackoverflow users refer to as an MCVE (minimal, complete, verifiable example): https://stackoverflow.com/help/mcve

I understand.

The Fortran code is as minimalistic as it can be and it already has 116 lines. The cuda code will have quite many lines, even after I remove all the unrelated functions and compact it to a single file. It would take some time to do it. I will have it in a day or two.

It will be helpful if you can post the code here, as opposed to linking it offsite. Links tend to rot, and future readers of this thread are helped by having the code captured in the thread itself.

The entire computation seems to be made up from basic arithmetic, so with both CPU and GPU operating with IEEE-754 compliant double-precision operations, as long as the sequence of operations matches between platforms (i.e. we disable generation of FMAs in CUDA and suppress extra precision and non value preserving transformations in Fortran), we should get the same results for the same inputs. Everything else would be indicative of a bug somewhere.

One caveat here might be floating-point literal constants. In Fortran, they default to single precision, while in CUDA (as in C++) they default to double precision.

Thank you for your comments. I will make a small code for the cuda and posted it. It will more than than 100 lines but not very long.
I already have the functions, I just have to put them in a different file and remove the extra stuff.

Now I am quite certain the two codes give similar results and error. It is because of the last digit which is multiplied by by a large number. Maybe it can be improved.

For debugging these kind of numerical issues it is usually a good idea to make the CUDA code single threaded, i.e. use a runtime configuration of <<<1,1>>>

I note that the Fortran code contains this literal constant, which feeds directly into the computation of ‘wp’:

``````double precision, parameter :: cstd=560.6843136844119044
``````

This literal constant is treated as single precision, prior to assignment to the double-precision parameter. If I change the code to

``````double precision, parameter :: cstd=560.6843136844119044d+0
``````

to make it into a double-precision literal I observe significant differences in the output file forces.txt. There are additional floating-point literals used in the Fortran code that appear to be intended as double-precision literals, but are not in fact double-precision literals.

For your reference, here is the output of `head -100 forces.txt` after I compiled initforces.f as currently posted on Dropbox with ifort 13.1.3.198, switches /O1 /fp:strict, and ran it on 64-bit Windows 7 Professional:

``````1  0.000000000000000E+000  3.608438893628681E-003
-1.552689354866743E-008  1.391163095831871E-007
2  0.000000000000000E+000  1.082531668088604E-002
-8.840288501232862E-009 -6.082700565457344E-009
3  0.000000000000000E+000  1.804219446814340E-002
-5.697074811905622E-009  9.604264050722122E-010
4  0.000000000000000E+000  2.525907225540076E-002
-1.175067154690623E-008  1.865555532276630E-008
5  0.000000000000000E+000  3.247595004265813E-002
-7.661583367735147E-009 -2.776505425572395E-008
6  0.000000000000000E+000  3.969282782991549E-002
-7.661583367735147E-009 -1.688022166490555E-009
7  0.000000000000000E+000  4.690970561717285E-002
-8.025381248444319E-009  1.891748979687691E-009
8  0.000000000000000E+000  5.412658340443021E-002
-1.744047040119767E-008  1.880107447504997E-008
9  0.000000000000000E+000  6.134346119168758E-002
-9.989889804273844E-009 -5.384208634495735E-008
10  0.000000000000000E+000  6.856033897894492E-002
8.731149137020111E-011  8.175265975296497E-008
11  0.000000000000000E+000  7.577721676620229E-002
-1.971784513443708E-009 -9.022187441587448E-010
12  0.000000000000000E+000  8.299409455345966E-002
-5.697074811905622E-009 -4.560570232570171E-008
13  0.000000000000000E+000  9.021097234071701E-002
-6.628397386521101E-009  5.311449058353901E-008
14  0.000000000000000E+000  9.742785012797438E-002
-5.697074811905622E-009 -4.560570232570171E-008
15  0.000000000000000E+000  0.104644727915232
-3.637978807091713E-009  6.219488568603992E-008
16  0.000000000000000E+000  0.111861605702489
-2.706656232476234E-009  8.178176358342171E-009
17  0.000000000000000E+000  0.119078483489746
-1.251464709639549E-009 -8.963979780673981E-009
18  0.000000000000000E+000  0.126295361277004
-2.437445800751448E-009 -3.908644430339336E-008
19  0.000000000000000E+000  0.133512239064261
-9.422365110367537E-009  9.874929673969746E-008
20  0.000000000000000E+000  0.140729116851519
-9.822542779147625E-010 -1.148728188127279E-007
21  0.000000000000000E+000  0.147945994638776
-2.437445800751448E-009  8.411007001996040E-009
22  0.000000000000000E+000  0.155162872426033
-1.035368768498302E-008  9.036739356815815E-008
23  0.000000000000000E+000  0.162379750213291
-9.822542779147625E-010 -1.148728188127279E-007
24  0.000000000000000E+000  0.169596628000548
-9.822542779147625E-010 -7.770722731947899E-009
25  0.000000000000000E+000  0.176813505787805
-2.437445800751448E-009  8.411007001996040E-009
26  0.000000000000000E+000  0.184030383575063
-1.035368768498302E-008  9.036739356815815E-008
27  0.000000000000000E+000  0.191247261362320
-9.822542779147625E-010 -1.148728188127279E-007
28  0.000000000000000E+000  0.198464139149577
-2.437445800751448E-009  8.411007001996040E-009
29  0.000000000000000E+000  0.205681016936835
-1.035368768498302E-008  9.036739356815815E-008
30  0.000000000000000E+000  0.212897894724092
-2.437445800751448E-009 -9.869108907878399E-008
31  0.000000000000000E+000  0.220114772511350
-1.035368768498302E-008  9.036739356815815E-008
32  0.000000000000000E+000  0.227331650298607
-3.245077095925808E-009 -1.320731826126575E-007
33  0.000000000000000E+000  0.234548528085864
-1.640728441998363E-008 -7.261405698955059E-008
34  0.000000000000000E+000  0.241765405873122
-1.228909241035581E-008  2.984888851642609E-007
35  0.000000000000000E+000  0.248982283660379
-6.984919309616089E-009 -3.296881914138794E-007
36  0.000000000000000E+000  0.256199161447636
-3.725290298461914E-009  0.000000000000000E+000
37  2.083333203982981E-003  0.000000000000000E+000
-7.858034223318100E-008  1.392327249050140E-007
38  2.083333203982981E-003  7.216877787257362E-003
-7.776543498039246E-008  1.327134668827057E-008
39  2.083333203982981E-003  1.443375557451472E-002
-8.160714060068130E-008 -1.071020960807800E-008
40  2.083333203982981E-003  2.165063336177208E-002
-8.277129381895065E-008  1.816079020500183E-008
41  2.083333203982981E-003  2.886751114902945E-002
-8.626375347375870E-008 -1.396983861923218E-008
42  2.083333203982981E-003  3.608438893628681E-002
-8.533243089914322E-008 -7.450580596923828E-009
43  2.083333203982981E-003  4.330126672354417E-002
-8.533243089914322E-008 -3.725290298461914E-009
44  2.083333203982981E-003  5.051814451080153E-002
-8.568167686462402E-008  2.351589500904083E-008
45  2.083333203982981E-003  5.773502229805889E-002
-8.800998330116272E-008 -5.378387868404388E-008
46  2.083333203982981E-003  6.495190008531625E-002
-7.823109626770020E-008  3.166496753692627E-008
47  2.083333203982981E-003  7.216877787257361E-002
-7.823109626770020E-008  4.423782229423523E-009
48  2.083333203982981E-003  7.938565565983098E-002
-7.869675755500793E-008 -6.193295121192932E-008
49  2.083333203982981E-003  8.660253344708833E-002
-7.823109626770020E-008  4.586763679981232E-008
50  2.083333203982981E-003  9.381941123434570E-002
-7.869675755500793E-008 -5.331821739673615E-008
``````

Hello,

Yes this output is very similar to what I get. The first 2 columns are x and y coordinates of the particles and the 3rd and 4th the total forces. The same things comes out also with CUDA, though necessarily exactly the same numbers. The constants representation do not change much.

The problem is that these forces should be 0.

Fortran COde:

``````program semiimplicitampleq
implicit none
integer, parameter :: Ngas= 4320, sqgridx=60, sqgridy=36
double precision, parameter :: hgas=0.005, insp=hgas/1.2
double precision :: LLX
double precision :: LLY
double precision, parameter :: rhonot=1000.0
double precision, parameter :: cstd=560.6843136844119044
double precision, parameter :: cg=343*343 ! 25*25
double precision :: rhogas(Ngas),fx(Ngas),fy(Ngas)
double precision :: xpos(Ngas), ypos(Ngas)
double precision :: deltax, deltay, wp, rij, rij2, q
integer :: ip, jp
double precision :: qfd
double precision :: prhoip, prhojp

LLX=sqgridx*insp
LLY=sqgridy*insp*sqrt(3.0)

rhogas=rhonot
call posinit()

fx=0.0
fy=0.0
open(unit=7, file='forces.txt')
do ip=1, Ngas
do jp=1, Ngas

if(ip.ne.jp) then

deltax=xpos(ip)-xpos(jp)
if(deltax .lt. - LLX/2.0) then
deltax= deltax+LLX
endif
if(deltax .gt.  LLX/2) then
deltax= deltax-LLX
endif

deltay=ypos(ip)-ypos(jp)
if(deltay .lt. - LLY/2.0) then
deltay= deltay+LLY
endif
if(deltay .gt.  LLY/2) then
deltay= deltay-LLY
endif

rij2=deltax*deltax+deltay*deltay
if(rij2< 9.0*hgas*hgas) then

rij=sqrt(rij2)

q=rij/hgas
prhoip=cg*rhogas(ip)
prhojp=cg*rhogas(jp)
wp=((prhoip/rhogas(ip)**2+prhojp/rhogas(jp)**2)*qfd(q,cstd))/rij

fx(ip)=fx(ip)-wp*deltax
!      fx(ip)=fx(ip)-1

fy(ip)=fy(ip)-wp*deltay

endif

endif

enddo
write(7,*) ip, xpos(ip),ypos(ip), fx(ip), fy(ip)
enddo
close(7)

contains

subroutine posinit()
integer :: ii,jj,ip

open(unit=5, file='position.txt')
do ip=1,Ngas

ii=(ip-1)/sqgridy
jj=mod(ip-1, sqgridy)
xpos(ip)=ii*insp/2;
ypos(ip)=0.5*insp*sqrt(3.0)*(1-mod(ii,2))+jj*insp*sqrt(3.0);

write(5,*) ip, xpos(ip), ypos(ip)
enddo
close(5)

return
end subroutine

end program

double precision function qfd(q, cnst)

double precision :: q, cnst
double precision :: a1, a2, a3
a1=((3.0-q)*(3.0-q))*((3.0-q)*(3.0-q))
a2=((2.0-q)*(2.0-q))*((2.0-q)*(2.0-q))
a3=((1.0-q)*(1.0-q))*((1.0-q)*(1.0-q))
qfd=0
if(q<=3.0) then
qfd=-5.0*a1
endif
if(q<=2.0) then
qfd=qfd+30.0*a2
endif
if(q<=1.0) then
qfd=qfd-75.0*a3
endif
qfd=qfd*cnst

return
end
``````

The point is not that these computations do not sum to zero. They can’t, with finite precision floating-point arithmetic. Instead, the results basically represent the accumulated rounding error.

The point is that a sequence of IEEE-754 compliant floating-point operations should deliver the same results on any platform given the same inputs. I am quite confident that Intel Fortran with /fp:strict delivers appropriate reference results. The results are invariant to optimization level, and inspection of the generated assembly code shows SSE double-precision code that faithfully implements the computation as written.

Changing the type of the literal constant for just parameter like so:

``````!      double precision, parameter :: cstd=560.6843136844119044
double precision, parameter :: cstd=560.684313684411d0
``````

changes many of the outputs in forces.txt, including cases like this that involve a sign change:

``````>   3.353150646034919E-009  9.385985322296619E-009
<  -3.202870857421658E-009 -9.618815965950489E-009
``````

Similarly, making this change:

``````!      double precision, parameter :: hgas=0.005d, insp=hgas/1.2d
double precision, parameter :: hgas=0.005d0, insp=hgas/1.2d0
``````

leads to differences like this in forces.txt:

``````>          37  2.083333203982981E-003  0.000000000000000E+000
-7.858034223318100E-008  1.392327249050140E-007
<          37  2.083333333333333E-003  0.000000000000000E+000
8.847564458847046E-009  0.000000000000000E+000
``````

I would assume that the double-precision literals are what is intended here.

I understand that. I thought that I could get rid of the numerical error. So i suppose I can never make it disappear or get to lower level without having higher precision decreasing the two large constants (cg and cstd).