RAM usage with OpenMP

Hi All,

I noticed this about a large code I recently altered to run in parallel using OpenMP. When I compile the code for a single core, it uses about 0.3% of the system memory. When I compile using -mp and run it with 5 cores, it uses 4.6% of the memory. When it is run on 42 cores, it uses 35.9%. The percentage of the RAM used more or less seems to scale with the number of threads. However, I would expect the job running on 5 cores should take up at most 5 times the RAM as the single-core job.

Is this behavior expected? Is there some other compiler switch I am overlooking?

I used

pgf90 -Mextend -O3 -mp -mcmodel=medium -tp barcelona -o fred fred.for



Hi Jerry,

I have been checking on this, and I believe I understand what you are seeing.

When you compile an application with -mp, there is a certain amount of additional overhead that gets linked into the application in order to support OpenMP. This overhead is most noticeable when the memory usage of the rest of your application is relatively small. Based on your observations, I believe this accounts for the jump in memory usage from single core (without being compiled by -mp) to being compiled with -mp and running at 5 threads.

When you jump up to 42 threads, this initial overhead is amortized over all the threads in your application. Since your application appears to scale up memory usage with number of threads, the initial impact of this overhead is probably less noticeable at 42 threads than it would be at 5 threads.

Unfortunately, there isn’t really a good way to separate out memory usage involved with the OpenMP implementation versus the rest of the runtime (or your application), so the best advice I can give on measuring this overhead is to compile the application with -mp, run it with OMP_NUM_THREADS=1, and compare it to what you observe when you ran the application without compiling it with -mp.

Hope this helps.

Best regards,


Hi Chris,

Thanks for the reply. I thought the overhead seemed a little steep. I have a lot of variables in the parallel loop, so to get started I put them all in 'first private" directives. Since then, I have gone back and moved a few of the few of the data arrays to “shared” directives. This has helped the memory usage a bit. However, most of the variables do need to be private, so it seems the memory usage will always go something like 2.5OMP_NUM_THREADS rather than 1.0OMP_NUM_THREADS.

This is only an issue for some our older Intel boxes. For example, we have a 12-core box, and at the moment I can only get 10 threads going at once when using a small data set. It won’t go at all (except in serial mode) for the largest data sets.