results do not match comparing two runs using different EXE

Dr_Skids · January 7, 2010, 3:31pm

Hello,
I have a Fortran (77/90) code that is originally MPI and built specifically for 32 MPI Tasks and I have added Open MP structures. Because the code is pedominantly F77 in that subroutine, the Open MP is of the form “C$OMP”.

To confirm the code has not changed I ran the same case but used OMP_NUM_THREADS=1 and the compilation used -O0 (as for the pure MPI version). Also -mp=nonuma. The compiler is 9.0.4 on a Cray XT (barcelona quad-cores)

There is a comparator tool that shows there are differences in the results.

Then I disable the Open MP using "CCC " in first four columns and re-compile and again see discrepancies. I am unhappy about that.

Next I omitted the “-mp=nonuma” and then the results are identical. Probably because the code is essentially identically compiled at that point. this is puzzling but mainly because I do not understand what -mp option is doing when the Open MP is disabled in code.

The code is large (>80000 lines in one file and 330 subroutines) but I was only attempting Open MP in one subroutine at a time (being cautious).

I had wound down the Optimization from “-fast” to reduce the complexity of this issue but it seems to still be a problem. How am I supposed to have confidence in a hybrid code is I cannot get a single threaded version to agree with the “standard” MPI run. (eventually I expect to run 64 MPi tasks with as many as 24 Open MP theads on the next generation of the machine).

Any ideas or clarification about -mp=nonuma would be appreciated.

Dr. Skids

MatColgrove · January 7, 2010, 11:18pm

Hi Dr. Skids,

This is odd behavior. My best guess is that you have a UMR (uninitialized memory reference).

When compiling with “-mp”, with or without OMP directives, all local variables will be allocated on the stack. This is a requirement in order to make variables thread safe.

I suspect a UMR because variables on the stack are more likely to contain garbage. I doesn’t quite explain why a single threaded run gets the same results as your serial run, but UMRs introduce non-deterministic behavior and it could be just luck. If you can, please try running your application with Valgrind (www.valgrind.org). I find it very useful in finding UMRs.

Hope this helps,
Mat

Dr_Skids · January 8, 2010, 12:56pm

Hi Mat,
thanks for the reply. I knew it might be confusing so here is what I hope is clearer:

(A) MPI only code (set for 32 tasks) is my reference run
(B) Hybrid code but disable the Open MP using traditoin al F77 comments and then compile without “-mp=nonuma”
(c) Code same as (B) but compiled with “-mp=nonuma”
(D) Hybrid code with enabled Open MP statements but not activated through compilation
(E) as (D) but then turn on “-mp=nonuma”

All these with -O0 for simplicity.

Well I did an additional test of the reference code (A) but compiled with “-mp=nonuma” and the behaviour is that the results differ from the reference. So it is truly the choice of compiler flag that leads to different the results.

I should add this legacy code has been evolved elsewhere since 1988 hence it has F77 and F90 structures. I agree that it is likely to be UMR although all arrays are “static”. I have unearthed several coding errors over the last 12 months. However, my task is only to enhance the performance and I have little knowledge of it’s scientific basis.

I will look at valgrind.
Cheers,

Dr. Skids