Different results between openMP and single core

JeromeOrosz71787 · November 9, 2018, 6:23pm

Hi All,

I have an optimization code (written in FORTRAN) that implements a genetic algorithm. Given input data, the code defines a “population” of models, each with a vector of fitting parameters. The “fitness” of each model is computed, and models are ranked according to fitness. Based on this ranking, new models are “bred”. The fitness of the new population is computed, and the process repeats.

The loop that computes the fitness of each model can be run in parallel, and I have done so using openMP.

I have discovered that I don’t get identical results between the code compiled in the normal way and run on a single CPU and the code compiled using the -mp flag and run on multiple cores.

After each generation, the various arrays are written to a file, using this format:

FORMAT(99(1pe23.15,2x))

The last digit is probably flopping in the breeze a bit given the limits of double precision. After the first generation, the output files are nearly the same. The ndiff utility shows

210c210
<   5.369572146596790E+06    1.862345774856187E-07
--- field 1	relative error 2.23e-15
>   5.369572146596778E+06    1.862345774856192E-07
265c265
<   2.415612679943479E+07    4.139736507855217E-08
--- field 1	relative error 1.24e-15
>   2.415612679943482E+07    4.139736507855212E-08
299c299
<   4.228869572689122E+05    2.364698136963595E-06
--- field 1	relative error 1.18e-14
>   4.228869572689172E+05    2.364698136963567E-06
### Maximum relative error in matching lines = 9.81e-16 at line 245 field 1

The first column is the chi^2, and the second column is the fitness, which is 1.0d0/chi^2. In this run I have 100 models, so three of them had different output between the openMP version and the serial version. In two cases, the relative differences are a few times 1E-15, which is about what one can expect from double precision. However, the third one differs by a little over 1E-14. So somehow a digit was lost?

A generation or two later the results really start to diverge:

202c202
<   4.228869572689122E+05    2.364698136963595E-06
--- field 1	relative error 1.18e-14
>   4.228869572689172E+05    2.364698136963567E-06
226c226
<   9.801858231124967E+06    1.020214714822731E-07
--- field 1	relative error 1.93e-15
>   9.801858231124986E+06    1.020214714822729E-07
230c230
<   9.391013732425959E+03    1.064847766697571E-04
--- field 1	relative error 1.17e-15
>   9.391013732425970E+03    1.064847766697569E-04
232c232
<   9.697123939406827E+06    1.031233596939228E-07
--- field 1	relative error 1.95e-15
>   9.697123939406846E+06    1.031233596939226E-07
236c236
<   1.610902609306161E+07    6.207699920671891E-08
--- field 1	relative error 1.24e-15
>   1.610902609306163E+07    6.207699920671884E-08
247c247
<   5.369578966088783E+06    1.862343409633107E-07
--- field 1	relative error 1.67e-15
>   5.369578966088774E+06    1.862343409633111E-07
257c257
<   4.118674389812605E+05    2.427965664082270E-06
--- field 1	relative error 1.89e-14
>   4.118674389812527E+05    2.427965664082316E-06
269c269
<   2.414748317281700E+07    4.141218332541204E-08
--- field 1	relative error 1.24e-15
>   2.414748317281697E+07    4.141218332541209E-08
285c285
<   1.863263591365901E+07    5.366927173556431E-08
--- field 1	relative error 1.07e-15
>   1.863263591365903E+07    5.366927173556424E-08
294c294
<   3.648716487121342E+07    2.740689783735296E-08
--- field 1	relative error 1.09e-15
>   3.648716487121346E+07    2.740689783735293E-08
307c307
<         33        31
--- field 2	relative error 3.33e-02
>         33        30
370c370
<          7        13
--- field 1	relative error 1.32e+01
>        100        13
371c371
<        100        56
--- field 1	relative error 1.32e+01
>          7        56
400c400
<          1        30
--- field 2	relative error 3.33e-02
>          1        31
### Maximum relative error in matching lines = 9.16e-16 at line 225 field 1

There are more fitness values with issues in the last digit. In addition, we see differences in the rankings. The two integer columns are the model index, and its ranking. In the multicore run, the model at index 33 was ranked #31, but the same model in the serial run was ranked #30. It turns out that these models have very similar chi^2 values, but the loss of one or two digits can cause the rankings to flip. Once the rankings flip, the genetic algorithm proceeds in a different way, and the differences grow rapidly after that.

Note that the codes were compiled exactly the same way, except the -mp flag is included to use openMP. I have tried -O2, -O3, -fast optimizer flags, and I consistently see slight differences in the fitness values. The same files also contain the parameter arrays, and those are identical in earlier generation.

As I understand it, variables can either be put on the “heap” or in the “stack” depending on whether openMP is used. Why should this matter in terms of the precision of the results? I am using PGI version 16.5, and also version 18.4 on a different machine (both Intel Xeon of some sort). Is there a compiler flag or two that I can try to minimize or eliminate this behavior?

Jerry

MatColgrove · November 9, 2018, 8:39pm

Hi Jerry,

Given the order of operations can be different when an application is run in parallel (floating point operations are not associative), it’s not out of the ordinary to see slight differences in generated values. Especially if your code uses reductions, then these will give slightly divergent results when run sequentially vs in parallel.

Is there a compiler flag or two that I can try to minimize or eliminate this behavior?

You can try adding “-Kieee” so the compiler keeps to strict IEEE754 compliance, but this wont help with order of operations.

You may need to determine where the numerical divergence occurs, and then not run these numerically sensitive loops in parallel or rework your algorithms.

To help with this, you can give our new PCAST (PGI Compiler Assisted Software Testing) routines a try. PCAST allows you to instrument your code and then compare results with previous runs to determine where the numerical divergence occurs.

For more information please see: Detecting Divergence Using PCAST to Compare GPU to CPU Results | NVIDIA Technical Blog

-Mat

JeromeOrosz71787 · November 9, 2018, 10:51pm

Hi Mat,

Thanks for the reply. I found two uninitialized variables and fixed those. Those were not the cause of the differences.

For what it is worth, the Intel compiler shows a similar behavior in the end. In the generations leading up to the big divergence, the maximum differences in the fitness are on the order of 9E-16. However, at the same place, the ranking gets a bit tweaked, and it goes downhill from there.

These very small differences in the fitness are of course insignificant. However, it is more of the principle of the thing. Given some data, the code should run the same way on one core or on two cores, etc.

Thanks for the pointer to the PCAST code. Something like that looks like it would be really useful for me. The post you pointed me to says it is available in the 18.7 release, so the Community Edition 18.4 won’t work for this tool?

Thanks,

Jerry

MatColgrove · November 10, 2018, 1:15am

so the Community Edition 18.4 won’t work for this tool?

Not for 18.4, but is in 18.10, the next Community Edition, which will be out shortly.

For what it is worth, the Intel compiler shows a similar behavior in the end.

Not unexpected. This issue is inherent to running operations in parallel and not compiler specific.

-Mat