Accuracy, Inlining, and O Levels

I’m currently trying to inline (and soon reloop) a large piece of code that, eventually, will be put on GPU accelerators. But, until then, I am working solely on CPUs. My current attempt at this inlining–which requires inlining about 20 or so subroutines–has hit a possible roadblock: I seem to have lost accuracy.

To explain this, what I did was build a driver that runs two sets of calculations. It first runs the code in its full-of-calls, non-inlined, original glory. The various output arrays are then put into control arrays:


I then reinitialize everything and run the new inlined code, and then make an array that contains the absolute diffs between the new and old results:


Finally, I check to see if the resultant difference array is within a threshold value (in this case 1.e-08):

if (maxval(flc_diff) > thresh) then
   write (output_unit,*) "Failure with flc!"
   write (output_unit,*) maxval(flc_diff)
   write (output_unit,*) maxloc(flc_diff)

What I’ve found is that using compile options of:

FOPTS = -O0 -Kieee -r4 -Mextend -Mpreprocess -Ktrap=fp

I’m getting outputs of:

 Failure with flc!
         1493           14

with this value being the largest absolute difference I’ve seen.

This problem only cropped up after I inlined the very last subroutine call. Before this, I was getting under-threshold accuracy with even “-fast -Kieee”. I am certain I’m not stepping on any variables (some renaming was needed, but I’ve confirmed the renamed variables work in the non-inlined case with no loss of accuracy).

I suppose my question is, should I expect better accuracy than this? I don’t know how inlining code would cause more roundoff error than not inlining. I was expecting bit-identical results from just inlining at -O0 before I started changing the loop order.

Is there any way to get even less optimized and more accurate than “-O0 -Kieee”? Or, did I just coincidentally gather enough roundoff error with this last inline such that it makes a difference?


Hi Matt,

“-O0 -Kieee” means that the compiler is doing no optimization and using strict IEEE 754 conforming intrinsics. In other words, it’s most likely not an issue with precision. I would go back a recheck the code you inlined and look for possible coding errors.

Hope this helps,

Hmm. I’ll see what I can do, but you might be getting an email soon. I just cannot see how this isn’t working.

I did attempt to add a previously re-looped version of this subroutine instead of the straight cut-and-paste, just to see if that changed anything…and it core dumped. No error message or anything, just a straight core dump!

A-yup. Somehow some temporary array was being passed from one inlined section to another. I did the “make old engineers cry” method of duplicating every single temporary array with different names and it now does work to 1e-12 precision even with “-fast -Kieee”. (Whether or not real*4 can be that precise, well…)

I’m still not sure about the core dump, but I’m shelving that for now as I have some massive relooping coming up.

Thanks again, Mat, and apologies for wasting some SQL database entries!

Not a problem. I’d much rather have you ask, then not. I just glad you were able to find the error.

  • Mat