Execution problem when using mpiexec and PGI 6.0.x

fbissey · November 21, 2005, 3:03am

Hi,

I am using the pghpf compiler to compile my programs.
I am working on a cluster with a pbs queuing system.
I didn’t have any problems with version 5.x of the compiler.
We recently upgraded to version 6 and that’s where my
problem begun.
My executable is submitted to the queue with following script:

#!/bin/tcsh
#PBS -q q2w1n
#PBS -j oe -k oe
#PBS -l nodes=1:ppn=2
#PBS -v BeginCFG,EndCFG

#cd /home/fgcao/runVacuumResp
cd /home/fgcao/fbissey/SandBox

# Set run parameters
set ncpus=2

setenv LD_LIBRARY_PATH /opt/pgi601/linux86/6.0/lib

###########################################################
#  Script Name must be 15 characters or less
#  To run:
#          qsub -v BeginCFG=001,EndCFG=001 lyplan2.csh
#
###########################################################

# Deal with file names
#
set exeFlags   = "-n $ncpus"
set beta = "b460"
set size = "s16t32"
set imp  = "IMP"
set basedir = "/home/fgcao/Configurations/"$size"/"
set dir  = "su3"$beta$size$imp
set baseConfig = $dir"c"
set yorn = ".true."
set smear3d = 30
set prefix = "./results/"
set exeName = "./VacuumRespLYplan"$size

  set thisReport = "RunStatusLYplan"$size"-"$ncpus"c"$BeginCFG"-"$EndCFG
  echo `date`
  pwd

# Run the parallel program
  echo  "mpiexec -verbose $exeFlags $exeName -pghpf -np $ncpus > $thisReport"
  mpiexec -verbose $exeFlags $exeName -pghpf -np $ncpus > $thisReport << ....END
$basedir
$baseConfig
$BeginCFG
$EndCFG
$prefix
3  three-loop improved fMuNu
$smear3d
1  1: action and topological charge, 2: electric and magnetic fields
$yorn
....END

And is submitted using qsub. Program compiled with pghpf
version 5 work fine. With version 6 it doesn’t
I get the following message in one case:

mpiexec -verbose -n 2 ./VacuumRespLYplans16t32 -pghpf -np 2 > RunStatusLYplans16t32-2c192-192
0 - MPI_SEND : Invalid rank 1
[0]  Aborting program !
[0] Aborting program!
0 - MPI_SEND : Invalid rank 1
[0]  Aborting program !
[0] Aborting program!
mpiexec: Warning: tasks 0-1 exited with status 1.

and if I remove “-pghpf -np 2” from the script it becomes:

PGFIO-F-217/formatted read/unit=5/attempt to read past end of file.
 File name = stdin     formatted, sequential access   record = 1
 In source file VacuumRespLY_plan.f, at line number 119
[0] MPI Abort by user Aborting program !
[0] Aborting program!

In this case the program cannot read its input. I also have this
last behavior on a amd64 cluster without removing the
“-pghpf -np 2” argument.
Running the program interactively or changing the script to
execute outside of the queue (and on one processor) works.
Only when I try to run it with mpiexec in the queuing system
do I have problems.
What has changed to cause this behavior? And what can I do
apart from hardwiring my input?

brentl · November 21, 2005, 7:24pm

This might be pretty hard to track down… First things:

make sure that no code compiled with 5.2 is being mixed with 6.0., any libs, etc,.
$PGI/linux86/6.0/src/mpi/mpi.c
should be compiled with your version of the mpi headers, and mpi.o should be linked ahead of the pgi libs.

fbissey · November 21, 2005, 9:55pm

My own programs and libs are clean in that respect thanks
to a “make clean” . Now I am not doing the admin on the
cluster and I didn’t install MPI-CH myself. Does it need to
be recompiled against the new compiler or something?
Which brings your point #2 I guess.

I read the README file in that directory. You suggest that I replace the standard mpi library (in this case I link
against libfmpich.a from the mpi-ch distribution) by
the by the object generated by this file.
I will give it a go I guess.

fbissey · November 22, 2005, 12:13am

Trying to compile with mpi.o. The linker still requires
libfmpich.a which I think is were the problem may lie.
Anyway using mpi.o produced “gcc -ansi -c mpi.c”
give a linking error:

mpi.o(.text+0x21): In function `__hpf_ISEND':
: undefined reference to `lam_mpi_byte'
mpi.o(.text+0x55): In function `__hpf_IRECV':
: undefined reference to `lam_mpi_byte'
mpi.o(.text+0x9e): In function `__hpf_SEND':
: undefined reference to `lam_mpi_byte'
mpi.o(.text+0xcb): In function `__hpf_RECV':
: undefined reference to `lam_mpi_byte'
mpi.o(.text+0xe5): In function `__hpf_Abort':
: undefined reference to `lam_mpi_comm_world'
mpi.o(.text+0x115): In function `__hpf_Init':
: undefined reference to `lam_mpi_comm_world'
mpi.o(.text+0x11f): In function `__hpf_Init':
: undefined reference to `lam_mpi_comm_world'

So it doesn’t work anyway.

MatColgrove · November 23, 2005, 4:58pm

Hi fbissey,

As Brent indicated, this is tough one to track down since there are a lot of pieces in place. The most likely cause is that your MPICH fortran interface needs to be rebuilt using the 6.0 version of the compilers. However, given the undefined references in your last post and that your using mpiexe, It appears that your actually using LAM/MPI not MPICH. In either case, try compiling and linking with the MPICH libraries that were included with 6.0 CDK release (lin the “libs” directory). Then run your application using the “mpirun” script found the PGI bin directory.

If this works, then you should just need to recompile your MPI fortran interface. If it still fails, please send a report along with the code to trs@pgroup.com since it could be a compiler issue.

Thanks,
Mat

Topic		Replies	Views
mpich compiling error Legacy PGI Compilers	8	15523	August 25, 2007
-Mprof=mpich2 creates pgprof.out* for C but not for F90 Legacy PGI Compilers	2	2573	August 7, 2013
f90 linking problem Legacy PGI Compilers	4	16270	February 23, 2005
PGI worksation and MPICH2 Legacy PGI Compilers	4	6813	June 21, 2011
Building (or not) OpenMPI on a Mac Legacy PGI Compilers	16	21938	March 12, 2008
Error linking libmpichf90.a Legacy PGI Compilers	14	20508	November 27, 2006
Error Compiling mpich1.2.7p1 Legacy PGI Compilers	4	6548	January 4, 2007
MPICH 3.1 compiling Legacy PGI Compilers	6	7134	March 11, 2014
pgc++ failed to link with GNU MPICH2 library Legacy PGI Compilers	4	5106	February 17, 2012
problem to generate mm5.mpp Legacy PGI Compilers	3	5804	March 30, 2009

Execution problem when using mpiexec and PGI 6.0.x

Related topics