I have a strange problem wiith parallel programs on clusters with two CPUs (Xeon
or Opteron) per node. My program contains a computational domain of 600 x 600
grid points with one double precision varable per lattice site. The program measures
the pure computation time and the pure communication time separately. When the
program is run on 2 CPUs, each CPU gets a 300 x 600 portion to handle and therefore
the pure computation time should be cut in half on each CPU. Here is the problem:
When the job is run on two CPUs on different cluster nodes, this is exactly what
happens. But when the job is run on both CPUs of one cluster node, the time
for computaion remains unchanged although the domain is cut in half on each CPU!
This happens for Xeon and Opteron for PGI 5.2 or higher (older versions were not
tested). A slight effect in this direction is also obervable for Gnu C, but it is by far not
as severe as for pgcc. The operating system is SuSE Linux 8.1 and higher. For PGI
the mpich version that came with the respective PGI-CDK was used, for Gnu C
mpich-1.2.5.2 was used. The elapsed time for each jobs was cross checked with
timers independent of the program, it always corresponds to computation time +
communication time. No mpich calls are issued in the computational part.
It sounds like the processes are memory bound. How much computation is performed on each grid point? If your doing relatively few calculations per point, then each process will need fetch data from memory more often thus causing memory bus contention. If this is the case, you can try experimenting with prefetching, “-Mprefetch”, and/or non-temporal stores, “-Mnontemporal”, to see if you can alleviate some of the memory pressure.
The program is a simple minded iterative solver for the poisson equation in two
dimensions for performance test purposes. From the nature of the algorithm
the CPUs have to exchange data across boundaries after each iteration, so
very little computation is done per lattice point between mpich calls.
Neither -Mprefetch nor -Mnontemporal could alleviate the memory pressure.
This program is rather old and the only explanation I have for not realizing this
problem earlier is that I was doing my previous tests with the Gnu compiler which
only shows a slight indication of the problem. I also thought that in a NUMA
architecture each processor has his memory share attached to it and that memory
access of one CPU to its memory share does not disturb memory access of the
other CPU to its share. Why does this happen with PGI but -essentially- not with
Gnu also on dual Opteron nodes (where the hypertransport architecture accelerates
‘crosswise’ access of the CPUs to the memory)? The OpenMP version of my
program works nicely on dual Opterons and shows the typical bus contention
on dual Xeons when the cache size is exceeded.
I do think it’s memory issue but don’t really know why it doesn’t occur with GCC. However, you might be on to something. NUMA is not “on” unless you link in the NUMA libraries or use the utility “numactl”. Try linking with “-mp” which will link in the NUMA libraries. I don’t know if this will help, but since your OpenMP version works as expected, it’s worth a try.