Loop unrolling (PGI 5.1 and 5.2: pgf77)

Hi forum,

One of our users experiences subltle differences in the results of his (“bulky’”
density functional solver) code when loop unrolling is activated during compilation.
Some initial results differ by 1.0e-5 leading to sizeable differences in the final
(electron density) results. The code contains about 700 loops, so there is little
chance to pinpoint the one(s) that cause the trouble. We haven’t tried 6.0 yet,
but my understanding of loop unrolling is, that the order of the statements
arranged inside the loop and the order in which they are executed are not
changed. To this end, I would expect differences to be at most within the
computational accuracy (double precision), however, the observed deviations
are far bigger.

Has anyone noticed similar effects with loop unrolling?

Many thanks,
Michael

Hi Michael,

What type of system is this being run on and what flags are being used with each run? If your on a 32-bit system, this sounds more like a x87 precission issue rather than unrolling (See FAQ | PGI)

  • Mat

Hi Mat,

The system is an AMD Opteron and the compiler flags are -fast or all options
in -fast except loop unrolling. I checked this with one of my Monte-Carlo codes and
5.2 but didn’t see any difference even wiithin higher precision than mentioned above
(6 to 8 significant digits). I checked it again with all optimizations turned off and
found no difference.

I also checked 5.2 against 6.0 with and without loop unrolling on a 32-bit Xeon and
again did not find any difference. The last thing to do is a check of his code with
6.0 on an AMD Opteron.

By the way, my tests started a Monte-Carlo from always the same random number
seed, so from the sequence of operations the results are predetermined digit by digit.
If something in the MC code is upset by optimization, some results usually change quite
drasticallly in the course of the simulation due to some Monte-Carlo updates going
another way.

Best regards,
Michael

Hi Michael,

“-Munroll” shouldn’t have any effect on the order of operations. It can cause values to be stored in registers longer, but this would only effect precision when using the x87 FPU.

When you say “all options in -fast except loop unrolling” what exact flags are being used? I ask because the exact meaning of “-fast” can change and if the user is looking at an older manual then he/she might have missed a flag. (Note to find the most up-to date meaning for “-fast”, execute “pgf90 -help -fast” from the command line.) Specifically, I’m wondering if “-Mlre” was included. “-Mlre” performs loop-carried redundancy elimination and can have a impact on the loop’s operations. If I’m correct, have the user compile with “-fast -Mnolre” to turn off LRE.

  • Mat

Hi Mat,

Sorry for the long delay, I was on a workshop and in the meantime tests with 6.0
have been made on our Opterons. The results may be a relief for PGI, because
the program crashes now depending on the optimizations used. As I could not
reproduce any of the effects of loop unrollling with my own test codes, this indicates
a bug in the program that causes the problems. In fact, in an old part of the Fortran
code there is a subroutine to which all arguments are passed with a reference to an
array which holds all arguments of various types via ‘equvalence’ statements. The
cause of the trouble may be the optimization dependent placement of variables in
memory about which the subroutine most likely contains invalid assumptions.

All I could do here was wishing the (frustrated) user good luck in debugging the
‘dirty’ Fortran code of the subroutine.

Best regards,
Michael

Hi Mat,

Today I got the information, that on Pentiums the aforementioned user program
works fine (5.2 and 6.0). So the trouble is restricted to Opteron platforms. Do you
have an explanation for this?

Best regards,
Michael

Have the user compile and run the program with the following options. Let me know which get the expected answer.

On the P4,

  1. -fast -Kieee -pc 64 ← don’t accumulate values in 80-bit x87 FPU registers
  2. -fast -Mscalarsse ← Use SSE instead of x87
    If either of these fail there it’s a x87 vs SSE precission issue.

On the Opteron:

  1. -fast -tp k8-32 ← Should be the same as -fast -Mscalarsse on the P4
  2. -O0 ← If it fails here then its a 64-bit porting issue
  3. -O2
  4. -O2 -Munroll=c:1
  5. -O2 -Munroll=c:1 -Mlre
  6. -fast
  • Mat

Hi Mat,

Here are the results:

Pentium 4:

  1. -fast -Kieee -pc 64
  2. -fast -Mscalarsse
  3. -O0
    OK in all cases (correct result).

Opteron:

  1. -fast -tp k8-32
    could not be tested due to some numerical libraries, which are only available in 64 bit
  2. -O0
    pgi 5.2: the same as on P4
    pgi 6.0: segmentation fault
  3. -O2
  4. -O2 -Munroll=c:1
  5. -O2 -Munroll=c:1 -Mlre
    pgi 5.2 & pgi 6.0: “wrong” result, but consistent among one another
  6. -fast
    pgi 5.2 & pgi 6.0: the same as on P4

The last result surprises me. However, the tests were made for an untypically small
system (i.e., some arrays may not run out of their assumed bounds). I hope that
the results 2)-5) on the Opteron tell you something.

Many thanks and best regards,
Michael

Hi Michael,

Although I’m just speculating, it seems the program might be reading undefined memory. I’ve seen programs that have an ‘off by one’ error where the program was reading off the end of an array. Most of the time the program was lucky and the memory it was accessing was a valid location. However, changes in the compiler flags, compiler version, and/or platform can effect how the memory is laid out and result in unexpected behavior.

The next step is to determine why the program seg faults using “-O0” and the 6.0 PGI compilers in 64-bit. First add “-Mbounds” to see if your reading off the end of an array. Next compile and run with “-g”. If the program still seg faults use pgdbg or gdb to isolated where and why its seg faulting. “-g” can change the memory layout so the program might succeed. If this occurs, run the “-O0” compiled program through pgdbg. You wont have the debug symbol information but at least you can get the file name and line number where the seg fault occurs.

  • Mat

Dear Mat,

We have identified the source of the error (the program works with -g). The
cause is the value of an interger index, which is overwritten on the second
call to a subroutine by a meaningless value. The subroutine apparently
accesses memory bejond its allowed range when called a second time!
The integer index was passed to the subroutine as an additional parameter
to allow its value to be printed. The principal code structure is:

subroutine amix(…, t, n, …, index)
integer n, index
double precision t(n), temp1, temp2

do i = 1, n ! n = 1 here

temp1 = …
temp2 = …

print *, index ! value OK
t(i) = temp2 - temp1
print *, index ! meanigless value on second call

end do

end subroutine amix

The first call to amix leaves ‘index’ unchanged, the second call changes ‘index’ to
something meaningless (e.g. a large negative integer) which causes the program
to seg fault, when ‘index’ used to address another array. Apart from the ‘print’
statements the subroutine does not access ‘index’ and therefore index is normally
omitted from the parameter list. Nonetheless ‘index’ is corrupted after the second
call to the subroutine.

For me this seems to point to a compiler problem on 64 bit opteron systems.

Best regards,
Michael

Dear Mat,

I just learned that the compiler option -Msave makes the program run and yield
correct results for any optimization level. It seems as if there is a ‘memory leak’
caused by one of the subroutines in the program if compiled for 64 bit opteron
machines.

Best regards,
Michael

Hi Michael,

I’m glad you were able to get things to work correctly. Although I’m not entirely convinced that its a compiler bug, we’re happy to look at it here to see if we can determine the root cause of the error. Please send a note to trs@pgroup.com and they’ll give you instructions on how you can upload the code. Just give a brief explanation and let them know that you have been talking with me.

Thanks,
Mat