memory usage in a FORTRAN code with openMP

Hi All,

I have two FORTRAN codes that use openMP to run parallel loops. Both codes call the same subroutines that do most of the computation. Both codes seem to work in that the single-CPU versions give the same results as the parallel versions.

I noticed some odd behavior with the RAM. In one code, there are two parallel loops, with one run after the other, as in

!$omp  parallel firstprivate(lots of variables) shared(other variables)
!$omp do
do i=1,Nloops
  lots of computations
enddo
!$omp enddo

compute statistics, set up next loop

!$omp  parallel firstprivate(lots of variables) shared(other variables)
!$omp do
do i=1,Nloops
  more computations
enddo
!$omp enddo

When using a large number of threads, the code uses about 40% of the RAM (according to the top command) when executing the first loop. When it enters the second loop, the RAM usage roughly doubles. This behavior is not seen when compiled with gfortran.

Here are the compile commands:

pgf90 -O3 -Mextend -mcmodel=medium -mp -o fred_pgi fred.for

gfortran -fopenmp -mcmodel=medium -O2 -ffixed-line-length-132 -o fred_gfort fred.for

In my other code, there are 5 sets of parallel loops, with three of them having fewer iterations. In this case, the RAM usage increases by a bit over a factor of 4 as the program progresses. With gfortran, the RAM usage stays at a fixed value.

It seems like the PGI-compiled code does not want to give back the RAM that was used in a loop. In most cases the RAM usage does not exceed the machine capacity, but I have had a case or two where large arrays are needed and the RAM limit was exceeded, causing a system crash.

Is there a way to make the PGI-compiled codes stay at a fixed RAM usage, like the gfortran-compiled codes do? I tried to put

!$omp flush

after the first parallel loop, and nothing different happened.

Jerry

Can you give me some idea of the types of variables that are declared firstprivate? local fixed size arrays? automatic arrays?

Is the code structured so that there are multiple distinct parallel regions or just a single parallel region with multiple work sharing directives within the region?

If you could send a example that demonstrates the behavior you describe that would be useful.

There dozens of variables in the private clauses, including integers, double precision scalars as well as arrays.

I can try to get you some code later today. How should I deliver it?

Jerry

Hi Jerry,

Please send the code to PGI Customer Service (trs@pgroup.com) and ask them to send it Craig. If it’s too big for email, they can give you instructions on how to ftp it to us.

  • Mat

Hi Mat and Craig,

Thanks for the response. I sent two separate tar files, one with codes and one with input files. There are two separate codes that illustrate the effect. After each parallel loop is complete, the RAM usage jumps. There is a code that implements a differential evolution Markov chain, and it has two separate parallel loops. There is also a code that implements a genetic optimization algorithm, and it has 5 parallel loops. The genetic code thus has more jumps in the RAM usage.

I have tried the code on a Xeon system with 8 threads and also on a Xeon system with 40 threads, and both systems show the same trends in the RAM usage.

Jerry