I have a large F77 legacy code that was recently made to run in parallel using OpenMP. There are a few hundred variables, both scalars and arrays, that are involved in the parallel loops. Since most of the variables need to initially have values that are set before the parallel loop, I put every non-shared variable in a “firstprivate” clause to start off the parallel loop. I got things to run correctly in parallel.
After the initial stage of getting the code to run, I have been making tweaks. One tweak involved moving some array variables into a “private” clause since those arrays are filled inside of the parallel loop. To my surprise, the RAM usage went down, in some cases by up to 40% depending on the number of parallel loops.
So here is my first question:
I have two codes, with the only difference being the distribution of some variables in firstprivate and private clauses.
!$omp firstprivate(scalar1,scalar2,scalar3, ... !$omp& private(array1,array2,array3, ...
When I run the codes using say 20 threads, both codes should have 20 copies of array1, array2, array3, etc. Yet the second one uses less memory, and in some cases significantly less memory. This behavior is seen both with the PGI compiler and the Intel compiler. What is going on here?
These codes when compiled with the PGI compilers use about twice the RAM overall compared to codes compiled with gfortran or the Intel compiler. The codes have an initial parallel loop, followed by one or more parallel loops inside a larger loop:
CALL init(variables, ...) !$omp parallel default(none) firstprivate(variables, ... !$omp do DO i=1,Nloop CALL computevalues(variables ...) END DO !$omp enddo ! DO ibig=1,Nbig ! ! proceed according to the algorithm in question, using ! the current values of the variables ! CALL FIGURE_OUT_STEPS(variables, ...) ! !$omp parallel default(none) firstprivate(variables, ...) !$omp do DO i=1,Nloop CALL compute values(variables) END DO !$omp enddo
During the initial loop, when using all of the available threads, the code will use (for example) 20% of the system RAM. However, using a PGI-compiled code, the RAM usage jumps up to about 40% once the Nbig loop is entered. The code compiled with either gfortran or the Intel compilers does not do this (that is, the RAM stays at 20%). Any ideas on what is going on in this case?