Open MP Application Performance

Hi Folks,

I have been working with the PGI 6.0 compilers for about a month now whilst porting some in house simulation codes. We are finding that the performance scaling with PGI OpenMP running on Dual and Quad opteron systems is pretty poor.

We have three simulation codes which were all developed on large SMP boxes (mostly SGI) where they all scale pretty much linearly up to between 8 and 32 processors. However, when we build and run on the opterons we are only getting a speedup of 1.2-1.3 on dual processor systems and 2.1-2.3 on quad boxes.

Am I missing some optimisation options here ? There may be an issue with memory locality which could be either a kernel problem or internal to the complier. It appears that the codes are bottlenecking on memory transfers.

I’m tempted to think that this is a compiler/openmp implementation problem since we are finding pretty much the same thing with f77,f90, C and C++ but code like HPL Linpack scales pretty well when using MPI to handle the thread creation.

Any one had similar issues with their own code ?

Hi Cannonfodda,

It could be the compiler but most likely has to do where the OS is placing data. If you don’t already, try initializing your data in parallel. The memory is allocated based upon where it was first touched so initializing data in parallel helps distribute your data to the node where it will most likely be processed.

If available on your system, numactl will give you better control on how the processes are scheduled and where the memory is placed. We’ve found that for some memory bound programs using the interleave memory policy (“numactl --interleave=all”) can help quite a bit. Please refer to the numactl man page for more information.

With the 6.0-1 release, we changed our default openMP loop iteration scheduling to align iterations with array references. This change seemed to help many openMP codes. Unfortunately after we released 6.0-1 we found that it serverly hurt performance with several other codes. With 6.0-2 we reverted the default back to static scheduling and placed alternate “align” scheduling under the flag “-mp=align”. If you have release 6.0-1, try downloading and installing the most recent patch release (currently 6.0-4). If you have 6.0-2 or higher, try using the flag “-mp=align”.

With the upcoming 6.0-5 release, we have addressed an inefficiency in our implimentation of THREADPRIVATE. If you use THREADPRIVATE extensively, then please try this new version of the compilers once available.

Hope this helps,
Mat

HI Mat,

Thanks for the help.

I’ve tried to mess around with various numactl parameters and none of it seems to make a blind bit of difference. We are already using 6.0.4 and I do use threadprivate quite extensively which may explain why some of our simpler fortran codes scale much better.

I’ve been diggin around a bit more and got into some of the numa module source/docs and I think you are correct in saying that it’s a problem with where the kernel is physically allocating the memory. Unfortunately, there doesn’t seem to be a an easy solution :(. The code was developed on large CC-numa SGI machines where a combination of hardware and the kernel take care of migrating memory with the processes. I suppose I have just got lazy ;).

I’ll try what you suggest and allocate the memory in a parallel section of code. The combination of C++ constructors, memory allocation and OpenMP should make life interesting.

Cheers,

Campbell