Problems with PGF90 on AMD64 X2 platform (memory faults...)

Hi All,

I am having a problem getting my scientific modeling code to run on RHEL 4.1 x86-64. I am using version 6.05 of the PGI fortran compiler.

Some history:

The code crashes with a memory fault when compiled with optimization as a 64bit application on RHEL 4.1 x86-64.

The code compiles and runs WITHOUT optimization with some (but not all) compiler options as a 64 bit application on RHEL 4.1 x86-64

The code compiles and runs with and without optimization on 32bit RHEL V3.5 or V4.1

The code compiles and runs with and without optimization as a 32 bit application under RHEL 4.1 x86-64. Performance is poor in this case.

Data in memory appears to be corrupted. Similar to what might be seen with array out of bounds errors but no array addressing errors were found in the non-optimized 64 bit case. The compiler options and machine characteristics are listed below.

AMD64 X2 - 4600+ - 2Gb RAM
RHEL 4.1 x86-64

pgf90 6.0-5 64-bit target on x86-64 Linux

OPTG= -g -Mlist -Mdclchk -Mdepchk -Mfptrap -Minfo -Mbounds
Program runs apparently ok.

OPTG= -g -Mlist -Mdclchk -Mdepchk -Mfptrap -Minfo -Mbounds -Mchkstk -Mchkptr -Mchkfpstk
Program crashes with an array out of bounds error. Addition of the -Mchkstk -Mchkptr -Mchkfpstk options causes the array bounds error since it was not present with the first set of options.

OPTO= -O -Mlist
Program crashes with a memory fault.

Looking at the optimized failure. An integer value in memory is changed unexpectedly - the code between these two outputs is an small fortran subroutine with no direct access to global variables - the subroutine does call two routines written in C.

nizs1: 6
nizs2: 538976288

The problem could not be replicated when a smaller subset of code was extracted containing only the intervening code between these two statements.

Adding the following statement at the end of the intervening fortran routine:
write(0,*) ‘ciseed:’,ciseed

Results in the following:
nizs1: 6
ciseed: 725759
nizs2: 6

Addition of the write statement eliminated the memory corruption somehow. However, additional memory corruption problems surface later in code execution with the optimized 64 bit version of the executable - and I don’t trust any workaround where adding a write statement apparently solves the problem …

I can also get elimiate the array bounds error in the non-optimized failure case by adding write statements to the code …

Any suggestions or ideas?

Thanks,

David

Hi David,

I think the array bounds error you get when adding “-Mchkstk -Mchkptr -Mchkfpstk” might be spurious. “-Mchkfpstk” only is used with the x87 floating point stack which an AMD64 doesn’t use in 64-bit mode. “-Mchkstk” would only pertain to OpenMP which I don’t think you use. “-Mchkptr” checks to make sure that you’re not dereferencing a NULL pointer and should not effect your arrays.

The “-O2” (aka “-O”) error could be a compiler bug since at this level of optimization the compiler tries to store variables in registers and hoist invariants out of loops. Adding a print statement will effect what optimizations the compiler is able to perform and might mask the problem.

Another possibility is that you have a 64-bit porting issue since the program runs correctly in 32-bits. Typically these errors occur when mixing C and Fortran code since a C ‘long int’ and pointer data types are 8 bytes in 64-bits but 4 in 32-bits.

Without seeing the code, however, its very difficult to diagnose. Can you send a report to trs@pgroup.com? We should be able to have an engineer look at the code to see if its a compiler bug or program bug.

Thanks,
Mat