PG compilation/execution problems, works fine in UNIX system

I am extremely new to the whole LINUX/UNIX world, and have encountered a problem with an old code. I obtained an old FORTRAN 77 code that has been in use for many years and was asked to load it on to our system at school and run some calculations. When compiling with pgf77 the code compiled just fine but would bomb out on a segmentation fault when executing (after reading in an input file and starting to step through the calculation). I ran the debugging program and found out where the problem occurred, but it seemed like everything was fine. As a hunch I moved the code over to a UNIX based system, compiled and executed … where everything sung just nicely beginning to end.

The code uses a lot of the old “COMMON”, “DATA”, “DIMENSION”, and “MEMORY” statements from the era prior to dynamic memory allocation, and I was wondering if there is some inherent difference between LINUX and UNIX that is screwing me up (since there were no errors on the UNIX side) … and if so, if there is an “easy” or not so easy, but standard fix to this type of a problem. Any thoughts are greatly appreciated, sorry for not knowing more …

~Jack Galloway

Hi Jack,

Let try something easy first and see if your getting a stack overflow. Try setting your stack size to “unlimited”. The exact syntax depends on your shell.

For Bash use “ulimit -s unlimited”
For Tcsh use “unlimit” or “limit stacksize unlimted”

Another thing to try is to compile with “-Mbounds” to see if your writting past the end of an array. Different systems lay memory out differently so writting past the end of an array on one system may be ok, but fails on another.

If these don’t work, then you’ll need to do a some more investigation. Compile you code with “-g” and then run the executable using the PGI debugger, pgdbg. Once you find where the seg fault occurs, set a breakpoint before the offending code and step through the program line by line to see if you can determine what’s causing the error.

  • Mat

Mat, thanks for the reply. I did both things that you asked and it did return an error that said “PGFTN-F-Subscript out of range for array a…” and gave the corresponding line number.

I actually have run the debugger and found the line where the error is occurring. Interestingly the variable it bombs out on is a calculated variable of two numbers which are divided. Both numbers are real numbers, and exist (I ask for write statements just before it bombs out to ensure they are true values) which seems to indicate the variable the calculation is assigned to is fouled up in memory or something to that effect? One thought, I tried to track where this variable comes from, through the various subroutines where it is called, and it seems it originates in a *.libd file, which is a binary file containg a lot of nuclide information (this is a code performing nuclear calculations).

Two questions, if this array is out of range what does that mean? Second, if this binary *.libd was generated on a UNIX system, could it be malformatted for a Beowulf cluster running Linux? (I’m just tossing out guesses). I tracked the same variables on the UNIX side, where the code executes fine, and at some point they diverge with respect to the UNIX side , where the Linux numbers become extremely large, whereas in the UNIX side they are much smaller. Although I’m new, I really get the feeling there is something fouled up in memory between the two platforms. Thanks again for the help on this.

~Jack

Hi Jack,

While “-Mbounds” can give false positives, your should first investigate what’s happening with this array. It doesn’t always work if the dimensions of the array are unknown so if the lower or upper bounds printed in the error message are zero, then it’s most likely a false positive. If you are actually writting past the end of an array, you could be lucky and be writting to memory that is not used (such as padding) or you could be overwritting the value of an important variable such as a pointer. If the overwritten variable were a pointer, then this could cause a seg fault later in the program. Memory layout changes on different systems and using differenct compilers so would be why your seeing it on one system but not another.

I’m not familuar with the file extension “.libd”. Is this simply a binary file which contains data which is read by your program, or is a library containing executable code? If it’s just a data file, what is the Endianness (the order in which the bytes of multi-byte data-types are stored) of the UNIX system it was generated on? If it’s a mainframe IBM or Sun Sparc system, it’s most likely big endian. PC’s use little endian so reading in a file containing big endian values can cause unexpected results. If this is the case, try recompiling with “-Mbyteswapio” to tell the compiler that your using a big endian data file.

  • Mat

Mat,

You hit the nail on the head with the endianness comment. (I didn’t know much about this previously), but have found that the Sun Unix platform was maybe big endian and the Linux was little endian, and added a statement in the program stating:

CONVERT=‘BIG_ENDIAN’

when opening the file, as it was just a binary file with the .libd extension to tell the user that it is a library file. But you’re saying you can specify this same command at the compiler level using the “-Mbyteswapio” command, which is probably simpler and better than modifying existing code. Thanks so much for your help on this. Your guys’ compilers are sweet.

~Jack

Hi Jack,

I’m glad you found the problem. Using “CONVERT” is better if you’re mixing the endianess of your files or if you want to be explicit about the file’s data format. Using “-Mbyteswapio” converts all file IO to big-endian which can be a problem if you’re output needs to be little-endian. However, “-Mbyteswapio” is more portable since no source code change is needed and your binary data will still be compatible with the UNIX system. Use whatever’s best for your situation.

Also, thanks for the compliment! We do appreciate it.

  • Mat