I’m having a problem with a program where the results I get are dependent on the location of the program. The binary files are exactly the same, just copied from one path to the other. There is in principle nothing that should cause this, but the program is a messy monster, quite hard to debug (and I haven’t been able to create a simple test case so far).
The differences in the results are usually small, if any, but sometimes enough to cause the internal stability tests to fail. So far, I have only detected these differences with optimization level 2 or higher, and only with with the PGI compilers (pgfortran in particular), so I’m suspecting this may be a compiler bug or misconfiguration.
Apparently, when I have the program in a path with a short length, it works fine (meaning the results agree with what I get with other compilers), but if I copy the program to a longer path, sometimes the results are different.
Obviously, this not enough information to solve the problem. But I was wondering if there is any known issue that could be causing this, or if anyone has encountered similar problems. I’m using pgi 13.7-0.
My best guess is that there’s some type of memory issue with the program (like a UMR) and where the binary is located perturbates this just enough to cause the errors to appear. Granted I have no way of knowing if this is indeed the case, but whenever I encounter odd behavior that doesn’t make sense, it’s the first thing I check.
Can you please try running your program under Valgrind (www.valgrind.org) and see it finds any memory problems? I’d try it a couple different way, with and without optimization, and in both locations.
It’s possible that there’s a compiler optimization issue or there could be a subtle program issue that only gets exposed with a particular optimization. Let’s see if Valgrid finds anything, then go from there.
Thank you for your suggestion. I have been trying to track down this problem and it is quite elusive. I tried using valgrind, but it complains about an unrecognized instruction. When I compile with -tp=x64 valgrind doesn’t complain, but then the bug does not appear. That’s a hint.
Another hint. In the full program, the first place I could find where the problem appears was just a LAPACK call:
I checked that all the arguments are exactly the same on input, but the output is different depending on where the executable sits (I checked by writing to unformatted files and comparing those). This is with a stock LAPACK/BLAS suite, freshly downloaded from http://www.netlib.org/ and compiled with pgfortran. However, I have been so far unable to create a stand-alone minimal test.
Final hint. The bug disappears if I compile just the BLAS routines with -O0 (everything else with -O2).
Any suggestion and help for further debugging this would be appreciated.
In my experience, even the latest version of valgrind (3.9.0) isn’t happy with all sandybridge instructions and I’ve found that compiling with -tp=nehalem-64 and running on a sandybridge machine allows valgrind to work while still using AVX. @Ignacio Fdez. Galvan: hopefully -tp=nehalem-64 will keep the bug and allow valgrind to work.