OpenMP and magically evil performance results

I have a fairly large code which uses OpenMP heavily.

My machine is a dual-processor: Xeon e5-2620. 2 packages; 6 physical cores per package; hyperthreading 2 threads per core. Total of 24 virtual processors.

I intermittently experience a condition where the multi-processing…fails? Properties of this condition include:

  • The program returns the correct result…eventually.
  • When using 1 thread, the program maxes out a processor usage
  • When using 2-6 threads, the program has a speedup of 1.5-3-ish, with CPU utilization somewhat less than the threads/total-virtual-cores ratio.
  • At some number of threads (sometimes 13, sometimes a different number) the CPU utilization goes to less than 4%, or less than a single maxed-out core.
  • The condition persists through reboots, delete-all-compiled-code-and-rebuild-s, and other attempts to fix it
  • The condition magically goes away for weeks at a time

Sometimes I get a reasonable speedup vs number-of-cores curve (with a jump down at 7 cores, and otherwise a convex curve landing on ~10x), and sometimes my speedup curve suddenly becomes a slowdown curve.

Both good and bad results can come from the same code. I have tried to disable all “turn my processor down” power management settings.

Does anyone have any idea what’s going on here?

A follow-up on some noted PGI/Intel differences in OpenMP.

When things are working well with my code, and I set the number of threads to 12, the program executes on one physical processor. That is to say, all 12 processes are launched on the 12 virtual cores corresponding to the 6 physical cores of one of the two packages.

This is exactly wrong from a performance point of view for my particular case (where it would be more optimal to use 12 threads across as many physical cores as possible). Intel OpenMP has some flags for setting this:
thread affinity interface

Is there any equivalent interface that I have missed for PGI OpenMP?[/url]

Hi Andrew,

Is there any equivalent interface that I have missed for PGI OpenMP?

Yes, “MP_BIND=yes” to enable thread/core binding. This will bind thread 0 to core 0, thread 1 to core 1, etc. Optionally, you can set MP_BLIST=0,2,4,…" to set the specific binding order.

There’s also the OpenMP standard “OMP_PROC_BIND=true” environment variable.

If you’re on Linux, I find the “numactl” utility useful as well. It sets the binding but is agnostic to the compiler so the same settings can be used without concern for how the binary was built. (See “man numactl” for details)

Note you might be better off halving the number of threads and running only on the physical cores. Hyper threading is usefully for fast context switching but only one thread can use a core at a time. Hence if your threads are engaged in heavy computation, there will be contention for the core. I may not be better, but worth an experiment.

Another environment variable to try is “MP_SPIN”. This sets the number of times the OpenMP runtime checks a semaphore before putting the thread to “sleep” (sched_yield) when the thread is blocked. Setting “MP_SIPN=-1” say to never sleep and saves the cost of saving and restoring the thread. However, the thread will poll and thus take up computational resources which may be a problem if you over scribe your physical cores or have other applications running.

Hope this helps,
Mat

Thanks, Mat. Helps some.

I get distinctly conflicting impressions about hyperthreading depending on what/where I read. But at any rate, MP_BLIST and MP_BIND are enough to tweak the affinity and assign processors to physical cores (which is mostly the best way to do it, from my experiments).

Any thoughts on the magical OpenMP-not-working situation? I don’t have it right now, but I don’t know what made it go away and what will make it come back.

Any thoughts on the magical OpenMP-not-working situation?

My best guess is that the OS was scheduling all the thread on the same core. Let’s see if it still persists after you start binding.

  • Mat