OpenMP private and firstprivate memory usage (FORTRAN)

Hi All,

I have a large F77 legacy code that was recently made to run in parallel using OpenMP. There are a few hundred variables, both scalars and arrays, that are involved in the parallel loops. Since most of the variables need to initially have values that are set before the parallel loop, I put every non-shared variable in a “firstprivate” clause to start off the parallel loop. I got things to run correctly in parallel.

After the initial stage of getting the code to run, I have been making tweaks. One tweak involved moving some array variables into a “private” clause since those arrays are filled inside of the parallel loop. To my surprise, the RAM usage went down, in some cases by up to 40% depending on the number of parallel loops.

So here is my first question:

I have two codes, with the only difference being the distribution of some variables in firstprivate and private clauses.

!$omp  firstprivate(array1,array2,array3,scalar1,scalar2,scalar3...


!$omp firstprivate(scalar1,scalar2,scalar3, ...
!$omp&   private(array1,array2,array3, ...

When I run the codes using say 20 threads, both codes should have 20 copies of array1, array2, array3, etc. Yet the second one uses less memory, and in some cases significantly less memory. This behavior is seen both with the PGI compiler and the Intel compiler. What is going on here?

Second question:

These codes when compiled with the PGI compilers use about twice the RAM overall compared to codes compiled with gfortran or the Intel compiler. The codes have an initial parallel loop, followed by one or more parallel loops inside a larger loop:

 CALL init(variables, ...)
!$omp parallel default(none) firstprivate(variables, ...
!$omp do
          DO i=1,Nloop
             CALL computevalues(variables ...)
          END DO
!$omp enddo
           DO ibig=1,Nbig
!            proceed according to the algorithm in question, using
!            the current values of the variables
              CALL FIGURE_OUT_STEPS(variables, ...)
!$omp parallel default(none) firstprivate(variables, ...)
!$omp do
              DO i=1,Nloop
                 CALL compute values(variables)
              END DO
!$omp enddo

During the initial loop, when using all of the available threads, the code will use (for example) 20% of the system RAM. However, using a PGI-compiled code, the RAM usage jumps up to about 40% once the Nbig loop is entered. The code compiled with either gfortran or the Intel compilers does not do this (that is, the RAM stays at 20%). Any ideas on what is going on in this case?



Hi Jerry,

My guess for both questions is where the data is being stored, heap or stack.

For Q1, with the exception of the potential for the initialization array, the amount of memory used by the thread between firstprivate and private should be the same. Though, it might be that in the private case, the arrays may be being placed on the stack instead of the heap thus making it appear that you’re using less memory since the heap is allocated memory while the stack is static.

For Q2, are you using automatics? If so, try adding the flag “-Mstack_arrays”. This will move automatic arrays from the heap to the stack. My guess is that PGI has these arrays on the heap while GNU and Intel are putting them on the stack.


Hi Mat,

Thanks for the quick reply. For your response to the second question, I don’t know what you mean by “automatics”. I generally use these flags when compiling:

-fast -Mfprelaxed -Mipa=fast,inline -mcmodel=medium -mp


I don’t know what you mean by “automatics”.

By “automatics”, I’m meaning Fortran automatic arrays. i.e. a local array in a subroutine who’s size may vary from call to call,

subroutine foo (size) 
   integer size 
   real array(size) 

These arrays may be either allocated in the heap or stack.

When you say “RAM usage”, I’m taking that to mean the amount of memory being used by the heap (as seen by top or other utility). Hence where automatic arrays are allocated will effect what your seeing and may account for the differences.

Hi Mat,

Thanks for the clarification. I tried the -Mstack_arrays option. I saw a 1% decrease in RAM usage (as seen by the “top” utility) compared to the same code compiled without that flag (22.61% and 23.61%). However, I saw a large decrease in the overall performance. The test case using -Mstack_arrays took 1390 seconds, and the test case using all of the same flags, but without -Mstack_arrays, took 980 seconds.

It appears that the -Mstack_arrays flag is not desirable for my particular case.