MPICH - parallel programs on dual Xeon / Opteron clusters

mkrech · April 10, 2006, 1:51pm

Dear forum,

I have a strange problem wiith parallel programs on clusters with two CPUs (Xeon
or Opteron) per node. My program contains a computational domain of 600 x 600
grid points with one double precision varable per lattice site. The program measures
the pure computation time and the pure communication time separately. When the
program is run on 2 CPUs, each CPU gets a 300 x 600 portion to handle and therefore
the pure computation time should be cut in half on each CPU. Here is the problem:
When the job is run on two CPUs on different cluster nodes, this is exactly what
happens. But when the job is run on both CPUs of one cluster node, the time
for computaion remains unchanged although the domain is cut in half on each CPU!
This happens for Xeon and Opteron for PGI 5.2 or higher (older versions were not
tested). A slight effect in this direction is also obervable for Gnu C, but it is by far not
as severe as for pgcc. The operating system is SuSE Linux 8.1 and higher. For PGI
the mpich version that came with the respective PGI-CDK was used, for Gnu C
mpich-1.2.5.2 was used. The elapsed time for each jobs was cross checked with
timers independent of the program, it always corresponds to computation time +
communication time. No mpich calls are issued in the computational part.

Any ideas?

Many thanks,
Michael

MatColgrove · April 10, 2006, 9:35pm

Hi Michael,

It sounds like the processes are memory bound. How much computation is performed on each grid point? If your doing relatively few calculations per point, then each process will need fetch data from memory more often thus causing memory bus contention. If this is the case, you can try experimenting with prefetching, “-Mprefetch”, and/or non-temporal stores, “-Mnontemporal”, to see if you can alleviate some of the memory pressure.

Hope this helps,
Mat

mkrech · April 11, 2006, 8:33am

Dear Mat,

The program is a simple minded iterative solver for the poisson equation in two
dimensions for performance test purposes. From the nature of the algorithm
the CPUs have to exchange data across boundaries after each iteration, so
very little computation is done per lattice point between mpich calls.

Neither -Mprefetch nor -Mnontemporal could alleviate the memory pressure.
This program is rather old and the only explanation I have for not realizing this
problem earlier is that I was doing my previous tests with the Gnu compiler which
only shows a slight indication of the problem. I also thought that in a NUMA
architecture each processor has his memory share attached to it and that memory
access of one CPU to its memory share does not disturb memory access of the
other CPU to its share. Why does this happen with PGI but -essentially- not with
Gnu also on dual Opteron nodes (where the hypertransport architecture accelerates
‘crosswise’ access of the CPUs to the memory)? The OpenMP version of my
program works nicely on dual Opterons and shows the typical bus contention
on dual Xeons when the cache size is exceeded.

Still puzzled,
Michael

MatColgrove · April 12, 2006, 2:09am

Hi Michael,

I do think it’s memory issue but don’t really know why it doesn’t occur with GCC. However, you might be on to something. NUMA is not “on” unless you link in the NUMA libraries or use the utility “numactl”. Try linking with “-mp” which will link in the NUMA libraries. I don’t know if this will help, but since your OpenMP version works as expected, it’s worth a try.

Mat

Topic		Replies	Views
Low performance no similar machine Legacy PGI Compilers	6	7565	June 23, 2008
parallel computation by using pgcc compiler in dual core mac Legacy PGI Compilers	6	9699	May 19, 2006
when two programs are ran,why does it take the same time? Legacy PGI Compilers	9	4771	December 8, 2011
accelerate a single loop with mpi and gpu Legacy PGI Compilers	21	16238	July 19, 2013
MPI with 1 thread is faster than 8 threads in my code Legacy PGI Compilers	2	3102	December 9, 2016
Open MP Application Performance Legacy PGI Compilers	2	10944	June 30, 2005
Memory Access Error Using OpenMP with OMP_NUM_THREAD=2 Legacy PGI Compilers	3	17588	September 23, 2004
Fortran with OpenMP almost no speedup Legacy PGI Compilers	15	12768	August 20, 2014
Mpirun problem Legacy PGI Compilers	2	16226	July 24, 2008
Poor OpenMP performance, compared to GCC gfortran Legacy PGI Compilers	6	4155	February 29, 2012

MPICH - parallel programs on dual Xeon / Opteron clusters

Related topics