Execution problem when using mpiexec and PGI 6.0.x

Hi,

I am using the pghpf compiler to compile my programs.
I am working on a cluster with a pbs queuing system.
I didn’t have any problems with version 5.x of the compiler.
We recently upgraded to version 6 and that’s where my
problem begun.
My executable is submitted to the queue with following script:

#!/bin/tcsh
#PBS -q q2w1n
#PBS -j oe -k oe
#PBS -l nodes=1:ppn=2
#PBS -v BeginCFG,EndCFG

#cd /home/fgcao/runVacuumResp
cd /home/fgcao/fbissey/SandBox

# Set run parameters
set ncpus=2

setenv LD_LIBRARY_PATH /opt/pgi601/linux86/6.0/lib

###########################################################
#  Script Name must be 15 characters or less
#  To run:
#          qsub -v BeginCFG=001,EndCFG=001 lyplan2.csh
#
###########################################################

# Deal with file names
#
set exeFlags   = "-n $ncpus"
set beta = "b460"
set size = "s16t32"
set imp  = "IMP"
set basedir = "/home/fgcao/Configurations/"$size"/"
set dir  = "su3"$beta$size$imp
set baseConfig = $dir"c"
set yorn = ".true."
set smear3d = 30
set prefix = "./results/"
set exeName = "./VacuumRespLYplan"$size

  set thisReport = "RunStatusLYplan"$size"-"$ncpus"c"$BeginCFG"-"$EndCFG
  echo `date`
  pwd

# Run the parallel program
  echo  "mpiexec -verbose $exeFlags $exeName -pghpf -np $ncpus > $thisReport"
  mpiexec -verbose $exeFlags $exeName -pghpf -np $ncpus > $thisReport << ....END
$basedir
$baseConfig
$BeginCFG
$EndCFG
$prefix
3  three-loop improved fMuNu
$smear3d
1  1: action and topological charge, 2: electric and magnetic fields
$yorn
....END

And is submitted using qsub. Program compiled with pghpf
version 5 work fine. With version 6 it doesn’t
I get the following message in one case:

mpiexec -verbose -n 2 ./VacuumRespLYplans16t32 -pghpf -np 2 > RunStatusLYplans16t32-2c192-192
0 - MPI_SEND : Invalid rank 1
[0]  Aborting program !
[0] Aborting program!
0 - MPI_SEND : Invalid rank 1
[0]  Aborting program !
[0] Aborting program!
mpiexec: Warning: tasks 0-1 exited with status 1.

and if I remove “-pghpf -np 2” from the script it becomes:

PGFIO-F-217/formatted read/unit=5/attempt to read past end of file.
 File name = stdin     formatted, sequential access   record = 1
 In source file VacuumRespLY_plan.f, at line number 119
[0] MPI Abort by user Aborting program !
[0] Aborting program!

In this case the program cannot read its input. I also have this
last behavior on a amd64 cluster without removing the
“-pghpf -np 2” argument.
Running the program interactively or changing the script to
execute outside of the queue (and on one processor) works.
Only when I try to run it with mpiexec in the queuing system
do I have problems.
What has changed to cause this behavior? And what can I do
apart from hardwiring my input?

This might be pretty hard to track down… First things:

  1. make sure that no code compiled with 5.2 is being mixed with 6.0., any libs, etc,.

  2. $PGI/linux86/6.0/src/mpi/mpi.c
    should be compiled with your version of the mpi headers, and mpi.o should be linked ahead of the pgi libs.

My own programs and libs are clean in that respect thanks
to a “make clean” . Now I am not doing the admin on the
cluster and I didn’t install MPI-CH myself. Does it need to
be recompiled against the new compiler or something?
Which brings your point #2 I guess.

I read the README file in that directory. You suggest that I replace the standard mpi library (in this case I link
against libfmpich.a from the mpi-ch distribution) by
the by the object generated by this file.
I will give it a go I guess.

Trying to compile with mpi.o. The linker still requires
libfmpich.a which I think is were the problem may lie.
Anyway using mpi.o produced “gcc -ansi -c mpi.c”
give a linking error:

mpi.o(.text+0x21): In function `__hpf_ISEND':
: undefined reference to `lam_mpi_byte'
mpi.o(.text+0x55): In function `__hpf_IRECV':
: undefined reference to `lam_mpi_byte'
mpi.o(.text+0x9e): In function `__hpf_SEND':
: undefined reference to `lam_mpi_byte'
mpi.o(.text+0xcb): In function `__hpf_RECV':
: undefined reference to `lam_mpi_byte'
mpi.o(.text+0xe5): In function `__hpf_Abort':
: undefined reference to `lam_mpi_comm_world'
mpi.o(.text+0x115): In function `__hpf_Init':
: undefined reference to `lam_mpi_comm_world'
mpi.o(.text+0x11f): In function `__hpf_Init':
: undefined reference to `lam_mpi_comm_world'

So it doesn’t work anyway.

Hi fbissey,

As Brent indicated, this is tough one to track down since there are a lot of pieces in place. The most likely cause is that your MPICH fortran interface needs to be rebuilt using the 6.0 version of the compilers. However, given the undefined references in your last post and that your using mpiexe, It appears that your actually using LAM/MPI not MPICH. In either case, try compiling and linking with the MPICH libraries that were included with 6.0 CDK release (lin the “libs” directory). Then run your application using the “mpirun” script found the PGI bin directory.

If this works, then you should just need to recompile your MPI fortran interface. If it still fails, please send a report along with the code to trs@pgroup.com since it could be a compiler issue.

Thanks,
Mat