Rounding error problems


I am trying to do a molecular dynamics code and I am having some trouble with the precision. I am writing an Molecular Dynamics code. I am doing everything in double precision, but it seems the program gives non-zero output for configurations which should have 0.

I am testing my code with a 43 particles arranged in a hexagonal configuration. All of them are identical and the total force on the particle in the center should be zero.

For each pair the force is computed using the following code:

double dx,dy;
dx=(dx> llx*0.5?dx-llx:dx);

dy=(dy> lly*0.5?dy-lly:dy);

double rij2=dx*dx+dy*dy;

   double rij=sqrt(rij2);
   double q=rij/hg;
   double dqkrij=quinticfirstderivative(q,cdqk);
   double wp=cgd*(1.0/(rhoip)+1.0/rhojp)*dqkrij/rij;

For the central particle the code is executed for all jp indeces.

Because the configuration is symmetrical the total force should be 0, but it is not. It 1.0e-10. This is large enough to make problems in a energy conserverving algorithm algorithm like the velocity Verlet.

The constant cgd is about 1.0e+6, the rhoip and rhojp 1.0e+3, while the factor dqkrij/rij is about 15.
The function quinticfirstderivative(q,cdqk) is a forth order polynomial a1*(3-q)^4+a2*(2-q)^4+a3*(1-q)^4.

When I put all this Fortran I get a similar behavior, however the forces are a few order of magnitude smaller.

When error the calcualtion depends on the cgd constant. The error decreases if I decrease it.

Is this kind of error normal for CUDA?

Is there anything I can do to make it of the order 1.0e-14 without changing the constants?

Can you show buildable and runnable code, e.g. including quinticfirstderivative(), together with a set of sample inputs and outputs for a case that you think should return zero mathematically? When you evaluate the computation in quadruple precision (or double-double), what kind of result do you get?

I suspect that what is happening is that there is subtractive cancellation occurring somewhere in the code and therefore tiny 1 ulp differences elsewhere in the computation ultimately lead to a result close to zero but not zero. That would mean we have poorly conditioned computation. If so, I would expect the closeness to zero in the outputs to vary depending on input data, and sometimes be smaller for your CUDA code and sometimes smaller for your Fortran code.

One aspect that can contribute to such 1 ulp differences in intermediate computation is the use of FMA (fused multiply-add) on the GPU, so for a quick experiment try compiling with -fmad=false. Note that the use of FMA typically results in more accurate results, in addition to higher performance.


Thank you for your reply. Here is the code for the kernel:

__device__ double quinticfirstderivative(double q,double cnst)
 double a1=((3.0-q)*(3.0-q))*((3.0-q)*(3.0-q));
 double a2=((2.0-q)*(2.0-q))*((2.0-q)*(2.0-q));
 double a3=((1.0-q)*(1.0-q))*((1.0-q)*(1.0-q));
 double www=0;
 return www*cnst;

I tried the flag, but it did not seem to help.

I wanted to avoid posting the code, because it is huge and it has many other things, but here it is:

The Fortran folder is just to test the same configuration (some parameters might be different, but it is not relevant).

The relevant file is The function computes the forces is gpu_gasgasforces(…). For the test this function is only called once for the initial configuration in the same line at around line 1650. The data is save in a file called newacc. For compiling there is a bash script with commands.

OK, now I see subtractions which could cause subtractive cancellation. It’s a bit curious that -fmad=false doesn’t make any difference. The basic arithmetic in both CPU and GPU is IEEE compliant, so should match. And with FMA-merging out of the way, the sequence of floating-point operations should match.

However, your Fortran compiler could transform the computation mathematically (something that Fortran allows but C/C++ frowns upon because floating-point arithmetic is not mathematics). Try setting the Fortran compiler for stricted possible IEEE compliance, e.g. for the Intel compiler this would be /fp:strict (Windows) or -fp-model strict (Linux). Does the Fortran code run on a platform that uses the x87 FPU for floating-point arithmetic? That could provide extra precision (more than double precision) for intermediate computation.

In order to locate the exact source of the numeric discrepancies, you will have to trace at least one “failing” test vector through all intermediate computations step by step.


My cpus are just normal laptop/desktop 64 bit intel processors. I was under the impresssion that on cp all operations are done internally in 80 bit precision and then the results are rounded back.

If I play with fortran and use 128 bit precision all the forces become 1.0e-30, if I keep 64 bit precision it is 1.0e-12. For the same numbers in CUDA I get 1.0e-7.

I realize now that only source of errors come from the factor dqkrij/rij (or maybe another bug in the code), however I thought the error in computing dqkrij should not be too large since it is only 4th order polynomial. I tried to rewrite the code to change order of some operation, but without any result.

Could the compiler do some nasty optimization in the function? Is there a better way to compute it?

Also the each thread computes the force for one particle (for each particle ip f(ip)=f(ip)+fij(ip,jp)).
Would the results get better if I would save each pair interaction (ip,jp) and then do a reduction to get the total force? This would introduce an additional write+read operation to the global memory which would impact the performance and I would have to spend another few days to implement it.

Could a small difference in computing the inter particle distance create so big differences in the 4th order polynomial?

Maybe I still have a bug in the code, but both cuda-memcheck and cuda-gdb report no errors.

If you are on a 64-bit OS, mathematical computations on the CPU are by default done through the SSE/AVX units, and there is just IEEE-754 double precision, no extra precision. 80-bit extended precision was used as the default mode with the x87 FPU used on 32-bit x86 platforms.

Other than merging FMULs and FADDs into FMAs, the CUDA compiler performs no re-associations in floating-point computations that are not value preserving. By using -fmad=false you should be guaranteed that the computation is performed exactly as written.

On the other hand, compilers for the CPU, such as the Intel compilers, will often default to a mode that allows the re-ordering of floating-point computations for performance reasons. That is why I suggested forcing the host compiler into strict IEEE-754 compliance.

Unless I overlooked something, your code only comprises basic arithmetic operations (+, -, *, /, sqrt) that are governed by IEEE-754 and therefore the same sequence of operations should produce the same result on both CPU and GPU. This may suggest that the numerical differences arise in code outside of the code shown above.

As I said, you will need to do a step-by-step analysis of the computation to find out where differences creep in. Given the small amount of code involved, you should be able to do that in less than an hour.

I may be looking at the wrong code, but I noticed the following. Your Fortran code has:


while your CUDA code has:

double wp=cgd*(1.0/(rhoip)+1.0/rhojp)*dqkrij/rij;

Not the same computation. I would suggest using exactly the same sequence of basic arithmetic operations on both platforms if you desire matching results. Mathematically identical expressions are rarely identical in finite-precision floating-point math.

Of course, once you go beyond basic arithmetic governed by IEEE-754 (e.g. transcendental functions), all bets are off.


I tried different formulas and order of execution, unfortunately without almost any visible change.


Thank you for trying to help me. Your explanations helped me understand more about the rounding. It looks like I was comparing apple with pears. The Fortran code seemed to miss an extra multiplication with a constant which about 500. Once I put that in and run it again for the exactly the same parameters and configuration I got the similar size for the error. What is supposed to be 0 becomes 1.0e-7. This did no solve my problem, but at least I think now there is little chance that there is a bug in my code.

Doing the calculations with 128 bit precision decreases the error to 1.0e-26, which is acceptable if I do the rest in 64 bit precision. However I think this is not possible in CUDA.

Now my only hope is that for the real runs the forces in the system are large enough so that this error does not affect the results of our simulations.

I will remember this. CUDA and CPU codes will give the same result up for normal operations up to machine precision.

I think it would be worthwhile to track down the source of the discrepancies in the interest of sound science.

If you can extract a minimal example into two standalone programs that demonstrate the issue and that I can build and run, I’d be happy to take a look. I have the 2013 Intel Fortran compiler, and I have CUDA 8.0.


In both CUDA and Fortran I tried to see which one gives the error if take a way the large prefactor. The error is reasonable small between 1.0e-13 and 1.0e-16. But when multiply by the prefactor it becomes 1.0e-7 to 1.0e-9. This is a test only to see what happens if I put a configuration which would give the total force equal to 0. In practice I suppose the force is quite large and other terms are present, but I except that even if my integration algorithm contains no noise, the trajectories of the particles will be different on differnt computers or different runs.

Both CUDA and Fortran codes from the dropbox link are fixed for the test. They have the same parameters and the same configuration for the particles. It would take a few hours to clean the cuda code, but the rest of the code does not affect the test, it is commented.

I supposed that if the Fortran code could be changed to reduce the error, without using lon-long it would be applicable to the CUDA code as well.

The CUDA code has the compilation command in a script COMPILE.GLdemo (nothing special, only that it includes several files and paths). The program initializes the variables, makes all the necessary allocations, and creates the first configuration. This is done with the function gpusphinit(…) from the file . The fucntion also calculations once the force and then it saves it into the file ‘newacc’. The forces are the second and third last columns.

The fortrant code is very simple. No special flags for compiling, I used gfortran. The forces are saved in a text file ‘forces.txt’.

I guess I expressed myself poorly. When I mentioned “minimal” and “standalone” I envisioned two single-file programs (one CUDA, one Fortran), each about 100 lines in size, posted to this thread. If need be, a small data file of similar size for the input data. In other words, what Stackoverflow users refer to as an MCVE (minimal, complete, verifiable example):

I understand.

The Fortran code is as minimalistic as it can be and it already has 116 lines. The cuda code will have quite many lines, even after I remove all the unrelated functions and compact it to a single file. It would take some time to do it. I will have it in a day or two.

It will be helpful if you can post the code here, as opposed to linking it offsite. Links tend to rot, and future readers of this thread are helped by having the code captured in the thread itself.

The entire computation seems to be made up from basic arithmetic, so with both CPU and GPU operating with IEEE-754 compliant double-precision operations, as long as the sequence of operations matches between platforms (i.e. we disable generation of FMAs in CUDA and suppress extra precision and non value preserving transformations in Fortran), we should get the same results for the same inputs. Everything else would be indicative of a bug somewhere.

One caveat here might be floating-point literal constants. In Fortran, they default to single precision, while in CUDA (as in C++) they default to double precision.

Thank you for your comments. I will make a small code for the cuda and posted it. It will more than than 100 lines but not very long.
I already have the functions, I just have to put them in a different file and remove the extra stuff.

Now I am quite certain the two codes give similar results and error. It is because of the last digit which is multiplied by by a large number. Maybe it can be improved.

For debugging these kind of numerical issues it is usually a good idea to make the CUDA code single threaded, i.e. use a runtime configuration of <<<1,1>>>

I note that the Fortran code contains this literal constant, which feeds directly into the computation of ‘wp’:

double precision, parameter :: cstd=560.6843136844119044

This literal constant is treated as single precision, prior to assignment to the double-precision parameter. If I change the code to

double precision, parameter :: cstd=560.6843136844119044d+0

to make it into a double-precision literal I observe significant differences in the output file forces.txt. There are additional floating-point literals used in the Fortran code that appear to be intended as double-precision literals, but are not in fact double-precision literals.

For your reference, here is the output of head -100 forces.txt after I compiled initforces.f as currently posted on Dropbox with ifort, switches /O1 /fp:strict, and ran it on 64-bit Windows 7 Professional:

1  0.000000000000000E+000  3.608438893628681E-003
 -1.552689354866743E-008  1.391163095831871E-007
           2  0.000000000000000E+000  1.082531668088604E-002
 -8.840288501232862E-009 -6.082700565457344E-009
           3  0.000000000000000E+000  1.804219446814340E-002
 -5.697074811905622E-009  9.604264050722122E-010
           4  0.000000000000000E+000  2.525907225540076E-002
 -1.175067154690623E-008  1.865555532276630E-008
           5  0.000000000000000E+000  3.247595004265813E-002
 -7.661583367735147E-009 -2.776505425572395E-008
           6  0.000000000000000E+000  3.969282782991549E-002
 -7.661583367735147E-009 -1.688022166490555E-009
           7  0.000000000000000E+000  4.690970561717285E-002
 -8.025381248444319E-009  1.891748979687691E-009
           8  0.000000000000000E+000  5.412658340443021E-002
 -1.744047040119767E-008  1.880107447504997E-008
           9  0.000000000000000E+000  6.134346119168758E-002
 -9.989889804273844E-009 -5.384208634495735E-008
          10  0.000000000000000E+000  6.856033897894492E-002
  8.731149137020111E-011  8.175265975296497E-008
          11  0.000000000000000E+000  7.577721676620229E-002
 -1.971784513443708E-009 -9.022187441587448E-010
          12  0.000000000000000E+000  8.299409455345966E-002
 -5.697074811905622E-009 -4.560570232570171E-008
          13  0.000000000000000E+000  9.021097234071701E-002
 -6.628397386521101E-009  5.311449058353901E-008
          14  0.000000000000000E+000  9.742785012797438E-002
 -5.697074811905622E-009 -4.560570232570171E-008
          15  0.000000000000000E+000  0.104644727915232     
 -3.637978807091713E-009  6.219488568603992E-008
          16  0.000000000000000E+000  0.111861605702489     
 -2.706656232476234E-009  8.178176358342171E-009
          17  0.000000000000000E+000  0.119078483489746     
 -1.251464709639549E-009 -8.963979780673981E-009
          18  0.000000000000000E+000  0.126295361277004     
 -2.437445800751448E-009 -3.908644430339336E-008
          19  0.000000000000000E+000  0.133512239064261     
 -9.422365110367537E-009  9.874929673969746E-008
          20  0.000000000000000E+000  0.140729116851519     
 -9.822542779147625E-010 -1.148728188127279E-007
          21  0.000000000000000E+000  0.147945994638776     
 -2.437445800751448E-009  8.411007001996040E-009
          22  0.000000000000000E+000  0.155162872426033     
 -1.035368768498302E-008  9.036739356815815E-008
          23  0.000000000000000E+000  0.162379750213291     
 -9.822542779147625E-010 -1.148728188127279E-007
          24  0.000000000000000E+000  0.169596628000548     
 -9.822542779147625E-010 -7.770722731947899E-009
          25  0.000000000000000E+000  0.176813505787805     
 -2.437445800751448E-009  8.411007001996040E-009
          26  0.000000000000000E+000  0.184030383575063     
 -1.035368768498302E-008  9.036739356815815E-008
          27  0.000000000000000E+000  0.191247261362320     
 -9.822542779147625E-010 -1.148728188127279E-007
          28  0.000000000000000E+000  0.198464139149577     
 -2.437445800751448E-009  8.411007001996040E-009
          29  0.000000000000000E+000  0.205681016936835     
 -1.035368768498302E-008  9.036739356815815E-008
          30  0.000000000000000E+000  0.212897894724092     
 -2.437445800751448E-009 -9.869108907878399E-008
          31  0.000000000000000E+000  0.220114772511350     
 -1.035368768498302E-008  9.036739356815815E-008
          32  0.000000000000000E+000  0.227331650298607     
 -3.245077095925808E-009 -1.320731826126575E-007
          33  0.000000000000000E+000  0.234548528085864     
 -1.640728441998363E-008 -7.261405698955059E-008
          34  0.000000000000000E+000  0.241765405873122     
 -1.228909241035581E-008  2.984888851642609E-007
          35  0.000000000000000E+000  0.248982283660379     
 -6.984919309616089E-009 -3.296881914138794E-007
          36  0.000000000000000E+000  0.256199161447636     
 -3.725290298461914E-009  0.000000000000000E+000
          37  2.083333203982981E-003  0.000000000000000E+000
 -7.858034223318100E-008  1.392327249050140E-007
          38  2.083333203982981E-003  7.216877787257362E-003
 -7.776543498039246E-008  1.327134668827057E-008
          39  2.083333203982981E-003  1.443375557451472E-002
 -8.160714060068130E-008 -1.071020960807800E-008
          40  2.083333203982981E-003  2.165063336177208E-002
 -8.277129381895065E-008  1.816079020500183E-008
          41  2.083333203982981E-003  2.886751114902945E-002
 -8.626375347375870E-008 -1.396983861923218E-008
          42  2.083333203982981E-003  3.608438893628681E-002
 -8.533243089914322E-008 -7.450580596923828E-009
          43  2.083333203982981E-003  4.330126672354417E-002
 -8.533243089914322E-008 -3.725290298461914E-009
          44  2.083333203982981E-003  5.051814451080153E-002
 -8.568167686462402E-008  2.351589500904083E-008
          45  2.083333203982981E-003  5.773502229805889E-002
 -8.800998330116272E-008 -5.378387868404388E-008
          46  2.083333203982981E-003  6.495190008531625E-002
 -7.823109626770020E-008  3.166496753692627E-008
          47  2.083333203982981E-003  7.216877787257361E-002
 -7.823109626770020E-008  4.423782229423523E-009
          48  2.083333203982981E-003  7.938565565983098E-002
 -7.869675755500793E-008 -6.193295121192932E-008
          49  2.083333203982981E-003  8.660253344708833E-002
 -7.823109626770020E-008  4.586763679981232E-008
          50  2.083333203982981E-003  9.381941123434570E-002
 -7.869675755500793E-008 -5.331821739673615E-008


Yes this output is very similar to what I get. The first 2 columns are x and y coordinates of the particles and the 3rd and 4th the total forces. The same things comes out also with CUDA, though necessarily exactly the same numbers. The constants representation do not change much.

The problem is that these forces should be 0.

Fortran COde:

program semiimplicitampleq
      implicit none
      integer, parameter :: Ngas= 4320, sqgridx=60, sqgridy=36
      double precision, parameter :: hgas=0.005, insp=hgas/1.2
      double precision :: LLX
      double precision :: LLY
      double precision, parameter :: rhonot=1000.0
      double precision, parameter :: cstd=560.6843136844119044 
      double precision, parameter :: cg=343*343 ! 25*25
      double precision :: rhogas(Ngas),fx(Ngas),fy(Ngas)
      double precision :: xpos(Ngas), ypos(Ngas)
      double precision :: deltax, deltay, wp, rij, rij2, q
      integer :: ip, jp
      double precision :: qfd
      double precision :: prhoip, prhojp
      call posinit()
      open(unit=7, file='forces.txt')
      do ip=1, Ngas
      do jp=1, Ngas
      if( then
      if(deltax .lt. - LLX/2.0) then
      deltax= deltax+LLX
      if(deltax .gt.  LLX/2) then
      deltax= deltax-LLX
      if(deltay .lt. - LLY/2.0) then
      deltay= deltay+LLY
      if(deltay .gt.  LLY/2) then
      deltay= deltay-LLY
      if(rij2< 9.0*hgas*hgas) then
!      fx(ip)=fx(ip)-1
      write(7,*) ip, xpos(ip),ypos(ip), fx(ip), fy(ip)
      subroutine posinit()
      integer :: ii,jj,ip
      open(unit=5, file='position.txt')
      do ip=1,Ngas
      jj=mod(ip-1, sqgridy)
      write(5,*) ip, xpos(ip), ypos(ip)
      end subroutine
      end program
      double precision function qfd(q, cnst)
      double precision :: q, cnst
      double precision :: a1, a2, a3
      if(q<=3.0) then
      if(q<=2.0) then
      if(q<=1.0) then

The point is not that these computations do not sum to zero. They can’t, with finite precision floating-point arithmetic. Instead, the results basically represent the accumulated rounding error.

The point is that a sequence of IEEE-754 compliant floating-point operations should deliver the same results on any platform given the same inputs. I am quite confident that Intel Fortran with /fp:strict delivers appropriate reference results. The results are invariant to optimization level, and inspection of the generated assembly code shows SSE double-precision code that faithfully implements the computation as written.

Changing the type of the literal constant for just parameter like so:

!      double precision, parameter :: cstd=560.6843136844119044 
      double precision, parameter :: cstd=560.684313684411d0

changes many of the outputs in forces.txt, including cases like this that involve a sign change:

>   3.353150646034919E-009  9.385985322296619E-009
<  -3.202870857421658E-009 -9.618815965950489E-009

Similarly, making this change:

!      double precision, parameter :: hgas=0.005d, insp=hgas/1.2d
      double precision, parameter :: hgas=0.005d0, insp=hgas/1.2d0

leads to differences like this in forces.txt:

>          37  2.083333203982981E-003  0.000000000000000E+000
 -7.858034223318100E-008  1.392327249050140E-007
<          37  2.083333333333333E-003  0.000000000000000E+000
  8.847564458847046E-009  0.000000000000000E+000

I would assume that the double-precision literals are what is intended here.

I understand that. I thought that I could get rid of the numerical error. So i suppose I can never make it disappear or get to lower level without having higher precision decreasing the two large constants (cg and cstd).