Results depend on compiled binary location

I’m having a problem with a program where the results I get are dependent on the location of the program. The binary files are exactly the same, just copied from one path to the other. There is in principle nothing that should cause this, but the program is a messy monster, quite hard to debug (and I haven’t been able to create a simple test case so far).

The differences in the results are usually small, if any, but sometimes enough to cause the internal stability tests to fail. So far, I have only detected these differences with optimization level 2 or higher, and only with with the PGI compilers (pgfortran in particular), so I’m suspecting this may be a compiler bug or misconfiguration.

Apparently, when I have the program in a path with a short length, it works fine (meaning the results agree with what I get with other compilers), but if I copy the program to a longer path, sometimes the results are different.

Obviously, this not enough information to solve the problem. But I was wondering if there is any known issue that could be causing this, or if anyone has encountered similar problems. I’m using pgi 13.7-0.

Hi Ignacio Fdez. Galvan,

My best guess is that there’s some type of memory issue with the program (like a UMR) and where the binary is located perturbates this just enough to cause the errors to appear. Granted I have no way of knowing if this is indeed the case, but whenever I encounter odd behavior that doesn’t make sense, it’s the first thing I check.

Can you please try running your program under Valgrind (www.valgrind.org) and see it finds any memory problems? I’d try it a couple different way, with and without optimization, and in both locations.

It’s possible that there’s a compiler optimization issue or there could be a subtle program issue that only gets exposed with a particular optimization. Let’s see if Valgrid finds anything, then go from there.

  • Mat

Thank you for your suggestion. I have been trying to track down this problem and it is quite elusive. I tried using valgrind, but it complains about an unrecognized instruction. When I compile with -tp=x64 valgrind doesn’t complain, but then the bug does not appear. That’s a hint.

Another hint. In the full program, the first place I could find where the problem appears was just a LAPACK call:

call dsygv(1,'V','L',n,Tr,n,Bk,n,Work(iW),Work(itmp),lwork,info)

I checked that all the arguments are exactly the same on input, but the output is different depending on where the executable sits (I checked by writing to unformatted files and comparing those). This is with a stock LAPACK/BLAS suite, freshly downloaded from http://www.netlib.org/ and compiled with pgfortran. However, I have been so far unable to create a stand-alone minimal test.

Final hint. The bug disappears if I compile just the BLAS routines with -O0 (everything else with -O2).

Any suggestion and help for further debugging this would be appreciated.

Hi Ignacio Fdez. Galvan,

From what you describe, I’d say AVX or FMA instructions could be the issue, however, it makes no sense why moving the binaries location would cause this.

A couple of things to try:

  1. Use the BLAS libraries that ship with the compilers, i.e. “-lblas” or “-lacml”.

  2. Add “-Mnofma” and/or “-Mvect=nosimd” to your compile flags.

Note that the valgrind error is because it doesn’t understand AVX instructions. Newer versions of valgrind have been updated to use AVX.

  • Mat

Hi Mat/Ignacio Fdez. Galvan,

In my experience, even the latest version of valgrind (3.9.0) isn’t happy with all sandybridge instructions and I’ve found that compiling with -tp=nehalem-64 and running on a sandybridge machine allows valgrind to work while still using AVX. @Ignacio Fdez. Galvan: hopefully -tp=nehalem-64 will keep the bug and allow valgrind to work.

Hope this helps,
Kyle

I guess the process is accessing outside memory, which happens to contain some path-dependent information. It’s probably some piece of memory initialized by the rest of my problem program.

With “-lacml” I don’t see the problem. Which make sense if the library is correctly pre-compiled, I assume.

Only “-Mvect=nosimd” (or “-Mvect=simd:128”) got rid of the bug.

No luck. Of the different targets listed in the manual, only sandybridge-64 reproduced the bug, and it doesn’t play nicely with valgrind.

By the way, I forgot to mention before that since the first message I upgraded to 14.7, so all this is with the latest pgfortran version.