I have a strange problem wiith parallel programs on clusters with two CPUs (Xeon
or Opteron) per node. My program contains a computational domain of 600 x 600
grid points with one double precision varable per lattice site. The program measures
the pure computation time and the pure communication time separately. When the
program is run on 2 CPUs, each CPU gets a 300 x 600 portion to handle and therefore
the pure computation time should be cut in half on each CPU. Here is the problem:
When the job is run on two CPUs on different cluster nodes, this is exactly what
happens. But when the job is run on both CPUs of one cluster node, the time
for computaion remains unchanged although the domain is cut in half on each CPU!
This happens for Xeon and Opteron for PGI 5.2 or higher (older versions were not
tested). A slight effect in this direction is also obervable for Gnu C, but it is by far not
as severe as for pgcc. The operating system is SuSE Linux 8.1 and higher. For PGI
the mpich version that came with the respective PGI-CDK was used, for Gnu C
mpich-22.214.171.124 was used. The elapsed time for each jobs was cross checked with
timers independent of the program, it always corresponds to computation time +
communication time. No mpich calls are issued in the computational part.