compensated summation and compiler optimizations

Hi All,

Many ODE solvers use some form of compensated summation to reduce round-off errors. In my particular case, there is a code segment that looks like this:

         ycsi=ycsi+(temp-yi)     !line 4
         ycs(i)=ycsi+(yi-y(i))     !line 7

The order of operations is important for compensated summation to work. For example, one might manipulate the above expressions and conclude that ysci at line 4 is zero, and that ysc(i) at line 7 is zero.

We have this ODE solver implemented in Fortran 90 and also Matlab. The Matlab version has slightly smaller long-term errors than the Fortran 90 version (we print out values of conserved quantities, and these numbers drift owing to round-off error). I strongly suspect that the Fortran compiler is tinkering with the code, where Matlab does not.

  1. How can I tell the compiler not to tinker with that bit of code? I tried using -O0 at compile time, but that did not seem to make any difference.

  2. Better yet, is there a compiler independent way to do this? For example, one can use OpenMP directives and many (if not most) Fortran compilers understand these.

  3. I noticed that gfortran has a REAL*10 (KIND=10) option. The machine precision for this option is eps = 0.108420E-18, where 1.0+eps=1.0. My understanding is that somehow they make use of these 80-bit thingamabobs that most Intel CPUs have, and as a result, the speed difference is only about a factor of 2 slower than when using KIND=8. Are there plans to include KIND=10 in the PGI compilers? In the case of our ODE solver, going to KIND=10 beats down the round-off error by roughly a factor of 1000 without a huge decrease in performance.


I think you’ll want to look at the volatile attribute:
I don’t think there is a way to mark a whole section of code as “do-not-optimize”, at least certainly not in a way that doesn’t involve compiler-specific extensions. Things like OpenMP or OpenACC directives work across most compilers because they are well-known standards written in such a way that multiple vendors can have their own implementations and they should work with each if they follow the rules of the standard. There isn’t such a standard (at least to my knowledge) that will let you have fine grain control over optimization. That kind of stuff is left for each vendor to implement how they’d like.

Using -O0 will globally turn off optimizations, but if that didn’t make a difference then I think it is unlikely that adding volatile to the types will have a noticeable.

Regarding REAL*10, I’m not sure what our plans are for that, I’ll need to ask around.

What architecture and what compiler version are you using? I’d be pretty surprised if we moved things around at -O0, but it is possible. I can dump the assembly and it seems to follow your logic exactly:


lineno: 4

movslq %r9d, %r10
vmovsd -8(%rsi,%r10,8), %xmm0

lineno: 5

vmovsd (%rcx), %xmm3
vmulsd (%rdx), %xmm3, %xmm3
movslq %r9d, %r10
vaddsd -8(%rdi,%r10,8), %xmm3, %xmm1

lineno: 6

vaddsd %xmm1, %xmm0, %xmm2

lineno: 7

vsubsd %xmm2, %xmm0, %xmm3
vaddsd %xmm1, %xmm3, %xmm1

lineno: 8

vmovsd (%rdx), %xmm3
vmulsd (%r8), %xmm3, %xmm3
vaddsd %xmm1, %xmm3, %xmm1

lineno: 9

vaddsd %xmm1, %xmm2, %xmm3
movslq %r9d, %r10
vmovsd %xmm3, -8(%rsi,%r10,8)

lineno: 10

movslq %r9d, %r10
vsubsd -8(%rsi,%r10,8), %xmm2, %xmm3
vaddsd %xmm1, %xmm3, %xmm3
vmovsd %xmm3, -8(%rdi,%r10,8)

lineno: 11

addl $1, %r9d
subl $1, %eax
testl %eax, %eax
jg .LB1_313

One possibility is the architecture we are defaulting to used FMA? Then you might see a difference.