-Mbounds, -O2 and debugging

Hi,

I am having unexpected results (NaN) when running a program, but these are solved by using -Mbounds at the compilation time. Removing -O2 also works fine, which I find strange as I thought that O2 was the default optimization level.
No warning or error is issued on array bounds being overun so I would like some ideas on how to debug this. The code is mostly f90 with a c function thrown in. Linux 64 (RHE 3), opteron.

Fails with:
pgcc -c -O2 -B *.f90
pgf90 -c -O2 *.c
pgf90 -o a.x *.o

Runs as expected with the following options:
-c -O2 -Mbounds
-c -Mbounds
-c

Problem2
I was having a runtime error executing the following read assignment:

real :: pa(k2,k3) # k3 = 1

read (nsurunit,’(8E12.6)’) pa(1:k1,:)

until it was changed to:

read (nsurunit,’(8E12.6)’) pa(1:k1,1)

The original line seems perfectly acceptable. any ideas why pgf90 was geting confused?

Cheers,
Tiago

Hi Tiago,

For problem 1, try compiling with “-O2 -Ktrap=fp”. It might shed some light where the NaNs are being generated. It is odd that it works with -Mbounds but it’s most likely luck than any thing else. Also try compiling with “-O2 -gopt” and running it through the debugger.

For problem two, the code looks ok but without the full context and what the actual error is, it’s very hard to tell. Can you please send a test case to PGI Customer Support at trs@prgoup.com? Also include what compiler version you’re using as well as which OS you have.

Note that default optimization level is -O1 not -O2.

  • Mat

Hi,

Thanks for the suggestions. As you say the errors are not always consistent so it must have been up to luck.
-Ktrap=fp doesn’t issue any messages either and I got a segmentation fault.
-O2 -g (-gopt ?) results in NaN and the fortran code STOPS after it runs into NaNs. Can I set an event in pgdbg if any variable takes the value NaN? The little matlab debugger does that.

Regarding the problem with array assigment, I was getting the following error:

PGFIO-F-252/formatted read/unit=35/operation attempted after end of file.

It is perhaps worth mention that this code has been used in many different platforms, compilers and OSs. But it had never been tested in my combination: AMD, Linux64 and PGI.

Thanks for any input,
t.

Hi T,

Can I set an event in pgdbg if any variable takes the value NaN?

No, but I’ll send a feature request to our Tools Group.

I like to try an narrowing down the scope of the problem. Let’s see if it’s porting problem by compiling in 32-bits, “-O2 -tp k8-32” or a precision issue by compiling “-O2 -Kieee”. Also, I’d like you to run your program through Valgrind to see if you have any memory problems such as uninitialized memory reads (UMR).

As for problem 2, I’m not sure why your reading past the end of the file and will most likely need to see the full code to understand what’s going on. Can you please send an example to trs@pgroup.com? I’ll be on a business trip the rest of week, but someone in Customer Service should be able to help.

  • Mat

I only just had the time to do some of the suggested test. No improvements using -Kieee but “-O2 -tp k8-32” seems to work well. It also works with -O3 but yields NaN if I use -fastsse :-(

Ok, so it seems to be a problem while generating 64 bit code. Does the suspicion of uninitialized memory reads still apply? Does anyone have other suggestions?

I still want to get it to work in 64 bit to make use of SSE2 goodies such as vectorization.

Thanks

Hi Tiago,

It could be a UMR, or precission issue, but I’m not entirely sure. One thing you could do is to compile each of your source files at “-O2”, then compile and link one at a time using “-O0” until the NaN’s go away. This will help in narrowing which source file the NaNs are being caused.

Next, separate each of the routines in this file into thier own file. Repeast compiling at “-O2” and “-O0” until your able to determine which routine is causing the problem.

Now recompile all files at “-g” and then create a second executable using “-O2 -gopt”. Run each executable side-by-side in their own PGDBG session, breaking at problem subroutine. Step through each line comparing the values of the variables until they diverge.

  • Mat

Hi again,

Still getting seg falt, but valgrind has detected some uninitialised values that take me back to the problem with the read command that I mentioned on the 1st post. Once more it seems to me that the code is good. Valgrind says (abriged):

Conditional jump or move depends on uninitialised value(s)
at 0x4F089F: fr_readnum (in /(…)/most_plasim_t21_l10_p1.x)
by 0x4F02FE: fr_read (in /(…)/most_plasim_t21_l10_p1.x)
by 0x4EFD6B: _f90io_fmt_read (in /(…)/most_plasim_t21_l10_p1.x)
by 0x503788: hpfio_read (in /(…)/most_plasim_t21_l10_p1.x)
by 0x5035B9: hpfio_loop (in /(…)/most_plasim_t21_l10_p1.x)
by 0x503909: hpfio_main (in /(…)/most_plasim_t21_l10_p1.x)
by 0x4EE0AA: pghpfio_fmt_read (in /(…)/most_plasim_t21_l10_p1.x)
by 0x49BEAF: surface_ini
(surfmod.f90:307)
by 0x402249: prolog
(plasim.f90:92)
by 0x401CBA: MAIN
(plasim.f90:56)
by 0x401C4D: main (in /(…)/most_plasim_t21_l10_p1.x)
–25634-- REDIR: 0x3CD886FBE0 (strnlen) redirected to 0x4906C80 (strnlen)
==25634==
Process terminating with default action of signal 11 (SIGSEGV): dumping core
Bad permissions for mapped region at address 0x95ADFC
at 0x95ADFC: ???
by 0x40245D: prolog
(plasim.f90:116)
by 0x401CBA: MAIN
(plasim.f90:56)
by 0x401C4D: main (in /(…)/most_plasim_t21_l10_p1.x)

On the code (surfmod.f90:307) this corresponds to :
integer :: ih(:)
(…)
read (nsurunit,’(8I10)’,IOSTAT=iostat) ih(:)

I have tried to play it safe with
integer :: ih(1:8) = 0
(…)
read (nsurunit,’(8I10)’,IOSTAT=iostat) ih(1:8)
but nothing changes:

The ASCII file that is being read also looks fine. Could this be a bug with pgf90 (6.0) and if so is there a workaround?

I wasn’t able to try your suggestion of compiling different objects with different optimization levels because It always crashes no matter the compiler flags.

Thanks,
tiago

Hi Tiago,

The binary search method I mentioned above should be used to help find the NaNs (aka Problem #1). As I mentioned before, for your seg fault (aka Problem #2) we need a way to reproduce the error here in order to determine if it is indeed a compiler issue. Please send a report to PGI Customer Service at trs@pgroup.com with example code.

Thanks,
Mat