OpenMP - high system usage

I have Fortran90 OpenMP code which I run on gentoo linux quad Opteron machine. The code has parallel and sequential parts which are executed in a loop sequence.

I experience the following - in parallel part all 4 CPUs show 100% user load as expected. When code enters sequential stage, however, one CPU shows 100% user load, while the rest three - 20% user + 80 % system = 100% load for each one as well.

It was the same for all kernels I tried from 2.6.3 to 2.6.8.
Same code run on another, dual, opteron machine (different motherboard) shows the same behaviour.

However, if I compile exactly the same code with Intel ifort, the resulting 32-bit code performs as expected - 100% user load on all 4 CPU’s at the parallel stage, and 100% user load on one CPU and 0% (both user and system) on the each of the rest three CPU’s during the sequential part.

Any ideas ?

BTW my quad-opteron hard crashes after an hour of such work, while dual is stable. Ifort produced code does not crash quad opteron - so the extra load reported with pgf90, I believe, is not imaginary.

I’d be very interested in knowing if the same behavior occurs on SuSE9.1 since Gentoo is not one of our supported OS’s. Most likely there is something wrong, its just a matter if its something fundamental or is it specific to Gentoo.

Other things you might try are:

It’s possible that your running out of stack space. Try setting your stack size to unlimited (‘unlimit’ command in tcsh) and rerunning.

Since it’s crashing after a period of time, there could be a memory leak. Watch your memory useage and see what the memory was at the time of the crash. (xosview is a nice utility for this).

What other process (including kernel process) are running at the time of the crash? What’s the system load at the time of the crash?

What happens when you compile with “-O0 -g -mp”? In other words, could optimization be causing the problem?

Since pgdbg (our debugger) allows you to debug multiple threads, try compiling with “-O0 -g -mp” and running the application through the debugger. Pay particular attention to what happens to the threads once they leave the parallel region.

If none of this helps, could we get a copy of the code here?


On my quad-opteron I have gentoo linux with 2.6.8-gentoo-r4 kernel.

cat /proc/version
Linux version 2.6.8-gentoo-r4 (root@quad-opteron) (gcc version 3.3.4 20040623 (Gentoo Linux 3.3.4-r1, ssp-3.3.2-2, pie-8.7.6)) #1 SMP Wed Sep 29 19:09:09 MDT 2004

I have also tried previous kernel versions 2.6.3, 2.6.7.
This is NUMA kernel with preempt, bank interleave is off in BIOS

But on my dual opteron I have Fedora Core 2 with FC2 updated 2.6.8 kernel

cat /proc/version
Linux version 2.6.8-1.521custom (root@weinberg) (gcc version 3.3.3 20040412 (Red Hat Linux 3.3.3-7)) #12 SMP Fri Sep 3 10:22:49 MDT 2004
This is also the kernel compiled with NUMA and preempt, however bank interleave is
on and NUMA is not used (if I understand correctly). Custom recompile was related
to 3ware controller.

The behaviour of the code is the same on both platforms. I have tried the kernel compiled
without the preempt and/or without NUMA option - the results are unchanged.

I would also be interested to try the code in Suse environment, but I don’t have
an access to any SUSE machine. In principle the compiled code is not large,
and has no interface to speak of to start, so I can easily provide compiled
binary if you can make use of it (or f90 code, but for compilations two custom libraries have to be linked in).

Regarding other suggestions

For now I don’t treat the crash as directlly, software-level related to pgf compiled code.
quad opteron crashes with machine check exception, which is hardware level unhappiness. Could be it just overheats with 4 CPU’s at 100% load for 1.5 hours.
I need to investigate this, bu dual machines does not crash at all with exactly the same binary.

I have been watching different system parameters. There is nothing obvious
going on. Memory usage is stable (code uses some 2.2 Gb RAM in parallel segment and
1.6 GB during sequential part, my quad machine has 32GB RAM, dual - 8GB)

There is nothing else special running at the time of execution. on quad opteron (server) machine, there is even no X server, just kernel daemons and some services.
I’m always a sinlge user when I’m running this code.

I’ll check the avg load and report it.
And also will try without optimization.

Could you instruct me how to send you the code ?

After receiving dmpogo’s code and taking with our engineers, I found out that this is actually an optimization. Instead of creating the threads every time a parallel region is encountered, we simply reuse the already created threads and save a substantial amount of overhead. When not in use, the extra threads are in a ‘sched_yield’ state. Meaning that they are only using idle cycles and will yield the processor to other jobs.

To work around this optimization, you can explicitly set the number of threads before and after the parallel region. So before the parallel region you would add:

call omp_set_num_threads(ncpus)

and after the parallel region you’d add:

call omp_set_num_threads(1)

Note however that some Linux distributions can crash if omp_set_num_threads is call too many times where too many is somewhere between 10 and 1000. I’d only use this method if you had only a few parallel regions.

Currently, dmpogo is investigating his system crash, but believes its due to factors other than the compiler.

Thanks for the question! I learned a lot while investigating it.

  • Mat