problem with openMP, PGI 18.10

Hi All,

I have noted before on these forums, I have a modeling code that handles a wide variety of situations. The “forward” model code has 285 subroutines. For convenience, these subroutines are in a single file attached to the main code with an INCLUDE statement. There are separate optimizing codes based of a variety of algorithms (genetic, MCMC, Nested Sampling, downhill simplex, etc.). These separate codes all include the 285 subroutines the forward model code has. In addition, they include another file with 103 subroutines related to the optimization process (reading data files, likelihood computation, etc.). Not all optimizing codes make use of all 103 subroutines. Some of the optimization algorithms lend themselves to parallel operation (for example the genetic code), and parallelization is accomplished using OpenMP directives.

These codes were originally in F77. I have ported them all to F90. The main differences are:

  1. I use free format for the F90 versions.

  2. The F90 versions make use of the INTENT(IN), INTENT(OUT), and INTENT(IN,OUT) declarations

  3. The F90 versions make use of ALLOCATABLE arrays whose dimensions are computed on the fly as needed.

Here is the issue: All of the parallel codes in F77 compile using the -fast and -mp flags. However, only as few of the F90 versions compile using the -fast and -mp flags (all of the codes compile just fine without the -mp flag and using -fast and -Mipa=fast,inline). The typical compilation time using -fast is two to three minutes, but in most cases the compile just hangs. I have tried using the -Minform=inform flag, but that is not helpful as the process hangs after a few lines of output.

One other possible clue: It is necessary to use the -mcmodel=medium flag to compile the F77 codes. That flag is not needed for the F90 versions.

From the looks of it, the compiler has problems when it encounters certain subroutines in the included optimization routines (I assume that if a subroutine is not called during the execution of the main program, it is ignored during the global optimization process). Is there a way to get more information out of the compiler to see where potential problems hight be?

Finally, I note all codes compile using the -O1 and -mp flags. Using -O2 and -mp causes the compiler to hang for certain codes. gfortran works using the -O3 and -fopenmp flags in all cases.

Thanks,

Jerry

Hi Jerry,

This might be tough for you to diagnose since for compiler hangs we typically need to run the compiler through a debugger to see where it’s getting stuck. Are you able to put together a package that we could use to recreate the issue here?

My best guess is that it might be inlining. Is it getting stuck during the second pass of the IPA compilation? Does the code compile without -Mipa=fast,inline?

You can add the verbose (-v) flag which will show which tool is being executed at the time of the hang and might give some clues.

Also, watch the memory usage from top. With huge source files with lots of inlining, the compiler can use up a lot of memory. Chewing through this amount of memory can take a long time and can give the appearance of a hang. It’s possible that it’s progressing, just slowly.

“-time” will show the compile time for each of the compilation phases, but probably not too useful for a hang since the times only get printed upon completion.

It is necessary to use the -mcmodel=medium flag to compile the F77 codes. That flag is not needed for the F90 versions.

Are any of the allocatable arrays over 2GB? If so, you may want to add the flag “-Mlarge_arrays” or keep “-mcmodel=medium” (which implies -Mlarge_arrays). Probably not the root cause of the hang, but worth a try.

-Mat

Hi Mat,

The total memory footprint when running on a single core is much less than 2GB.

The compiler hangs when using the -O2 flag only. The compiler still hangs with the -mcmodel=medium flag. Here is the output when I use the -v flag (along with -O2 and -mp):

Export PGI_CURR_CUDA_HOME=/opt/pgi/linux86-64/2018/cuda/9.1
Export PGI=/opt/pgi

/opt/pgi/linux86-64/18.10/bin/pgf901 hammerELC.f90 -opt 2 -nohpf -nostatic -x 19 0x400000 -quad -x 59 4 -x 15 2 -x 49 0x400004 -x 51 0x20 -x 57 0x4c -x 58 0x10000 -x 124 0x1000 -x 129 2 -tp skylake -x 57 0xfb0000 -x 58 0x78031040 -x 70 0x6c00 -x 47 0x400000 -x 47 0x08 -x 48 4608 -x 49 0x100 -x 120 0x200 -stdinc /opt/pgi/linux86-64/18.10/include-gcc48:/opt/pgi/linux86-64/18.10/include:/usr/lib64/gcc/x86_64-suse-linux/4.8/include:/usr/local/include:/usr/lib64/gcc/x86_64-suse-linux/4.8/include-fixed:/usr/lib64/gcc/x86_64-suse-linux/4.8/…/…/…/…/x86_64-suse-linux/include:/usr/include -cmdline ‘+pgfortran hammerELC.f90 -v -O2 -Mvect=sse -Mcache_align -Mpre -mp -o hammerELC_par’ -def unix -def __unix -def unix -def linux -def __linux -def linux -def __NO_MATH_INLINES -def LP64 -def __x86_64 -def x86_64 -def LONG_MAX=9223372036854775807L -def ‘SIZE_TYPE=unsigned long int’ -def ‘PTRDIFF_TYPE=long int’ -def extension= -def amd_64__amd64 -def __k8 -def k8 -def SSE -def MMX -def SSE2 -def SSE3 -def SSSE3 -freeform -vect 48 -x 54 1 -x 70 0x40000000 -y 163 0xc0000000 -x 189 0x10 -x 53 2 -quad -x 119 0x10000000 -mp -x 69 0x200 -x 69 0x400 -modexport /tmp/pgfortranYG4hwnMbEnlz.cmod -modindex /tmp/pgfortrancG4hg4xpjn_S.cmdx -output /tmp/pgfortransG4h25nEzxq7.ilm
PGF90-I-0035-Predefined intrinsic iidint loses intrinsic property (hammerELC.f90: 487)
PGF90-I-0035-Predefined intrinsic gamma loses intrinsic property (hammerELC.f90: 487)
PGF90-I-0035-Predefined intrinsic imag loses intrinsic property (hammerELC.f90: 487)

As you can see, the compiler got stuck relatively quickly. Several hundred lines are printed out when I compile a code that works.

I can put together a tar file with an example of a code that hangs the compiler, and one that does not. Where can I sent it?

Jerry

Hi Jerry,

I just sent you a link for a sFTP that you can use to upload the file.

Thanks,
Mat

Hi Mat,

Thanks for the link. I uploaded some codes that compile and some that don’t.

I forgot to mention that I use a shell script to compile all of the various codes. I started the script in the evening, and I discovered that it was still going in the morning. Thus, the compiler ran for several hours without progressing.

I also looked at the “top” output. The pgf901 task is using 0.073% of 64 GB on the code on which it hangs, which does not seem that high. I then looked at the top output on a case where the compiler worked. It started with the pgf901 task, and that used up to 0.2%. It switched over to the pgf902 task, and that used around 0.1%. Overall, the compiler took 4 minutes and 37 seconds on an Xeon W-2155 CPU @ 3.30GHz.

Jerry

Hi Jerry.

Ok, so there’s some good and bad news here.

First, I am able to recreate the hang and it looks to be a problem with intra-procedural pointer analysis due to the very large number of subroutine arguments. You can disable this optimization via the flag “-Hy,53,2” and the code should compile to completion.

However, when using PGI 19.1 and later, the front-end compiler crashes with signal 6 (bus error). I’ve traced this problem to when using more than 50 variables in a firstprivate clause (you use several hundred) and I have reported the error as TPR#27838.

As a quick work around, I changed “firstprivate” to “private” in the two files that hang. After this, I still see the hang with 19.1, but the compile succeeds with 19.4 (with and without -Hy,53,2). Hence I think the hang has been fixed already so I did not report this error. Of course, we’ll need to retest once the problem with “firstprivate” has been fixed.

So your current options:

  1. Continue to use 18.10 using “-Hy,53,2” as a work-around
  2. Move to 19.10, but limit the code to use 50 or less firstprivate variables per parallel region (this may be a pain though, sorry).

Thanks for the report,
Mat

Hi Mat,

Thanks for the help. I was able to use the -fast -mp flags along with the -Hy,53,2 flag (PGI 18.10). As for the excessive number of variables in FIRSTPRIVATE clauses, those variables need to be FIRSTPRIVATE rather than PRIVATE. I suppose I could have a double precision large array that stores individual variables, and have each subroutine grab the individual parameters as needed, but that would be a huge pain.

A few loose ends, which may not be important:

  1. The F77 versions of the code all compile with -fast -mp -mcmodel=medium flags (PGI 18.10). I guess the compiler takes a different route depending on whether an array has a specific dimension or is ALLOCATABLE.

  2. I found a working version PGI 16.5 on a machine I hardly use anymore. That version could compile the F90 codes using the -fast -mp flags.

Jerry