I’m currently trying to inline (and soon reloop) a large piece of code that, eventually, will be put on GPU accelerators. But, until then, I am working solely on CPUs. My current attempt at this inlining–which requires inlining about 20 or so subroutines–has hit a possible roadblock: I seem to have lost accuracy.
To explain this, what I did was build a driver that runs two sets of calculations. It first runs the code in its full-of-calls, non-inlined, original glory. The various output arrays are then put into control arrays:
I then reinitialize everything and run the new inlined code, and then make an array that contains the absolute diffs between the new and old results:
Finally, I check to see if the resultant difference array is within a threshold value (in this case 1.e-08):
if (maxval(flc_diff) > thresh) then write (output_unit,*) "Failure with flc!" write (output_unit,*) maxval(flc_diff) write (output_unit,*) maxloc(flc_diff) endif
What I’ve found is that using compile options of:
FOPTS = -O0 -Kieee -r4 -Mextend -Mpreprocess -Ktrap=fp
I’m getting outputs of:
Failure with flc! 2.3841858E-07 1493 14
with this value being the largest absolute difference I’ve seen.
This problem only cropped up after I inlined the very last subroutine call. Before this, I was getting under-threshold accuracy with even “-fast -Kieee”. I am certain I’m not stepping on any variables (some renaming was needed, but I’ve confirmed the renamed variables work in the non-inlined case with no loss of accuracy).
I suppose my question is, should I expect better accuracy than this? I don’t know how inlining code would cause more roundoff error than not inlining. I was expecting bit-identical results from just inlining at -O0 before I started changing the loop order.
Is there any way to get even less optimized and more accurate than “-O0 -Kieee”? Or, did I just coincidentally gather enough roundoff error with this last inline such that it makes a difference?