MPI problem with PGI CDK in cluster environment

Hi everybody,

I’ve set up a cluster with 8 nodes. First one is the master node, and after it there are 7 slaves. All of them are identical in hardware (except hdd, as master node serves filesystem to the others), and all of them are Ubuntu 8.04. Node names are signum0x with x from 1 to 8. Using ssh and public/private key pairs I can ssh and execute calls in all nodes in a password-less way.

After solving a couple of problems related to library versions, I got PGI CDK 5.2 running in a shared resource that every node is able to access.

But after compiling the example program:
pgf77 -o mpihello mpihello.f

If I try to execute it:
mpi -np 4 mpihello

I have as response:
rm_4916: p4_error: rm_start: net_conn_to_listener failed: 39633
p0_13060: p4_error: Child process exited while making connection to remote process on signum02: 0

Has anyone seen this problem before? I think it should be related to an incorrect installation in the rest of the nodes, but I cannot find what I did wrong. Any diag tool I could use to better understand what’s going on?

Thanks in advance…

Jose,

Sorry you are having problems. Release 5.2 is not supported on Ubuntu 8.04,
so we are wondering if the installation even completed correctly.

The first thing to do is to determine if the example will run multi-process
on the master node alone.

Use this version of mpihello.f

% more mpihello.f
program hello
include ‘mpif.h’
integer ierr, myproc,hostnm
character*64 hostname
call mpi_init(ierr)
call mpi_comm_rank(MPI_COMM_WORLD, myproc, ierr)
ierr=setvbuf3f(6,2,0)
ierr=hostnm(hostname)
write(6,100) myproc,hostname
100 format(1x,“hello - I am process”,i3," host ",A32)
call mpi_finalize(ierr)
end

pgf77 -o mpihello mpihello.f -lfmpich -lmpich

create a machines.LINUX file with only the
master node in it, and make sure you are running the
‘mpirun’ version form the MPICH directory installed
with the CDK.

mpirun -np 4 mpihello

and it should return the hostname of the master node.

If it works on one machine but not two, I would look again at the
installation.

dave

Hello,
I try to execut mpihello.f using this comand:
pgf77 -o mpihello mpihello.f -lfmpich -lmpich
and
mpirun -np 4 mpihello

I get an error like this:
problem with execution of mpihello on pc-asa: [Errno 2] No such file or directory
problem with execution of mpihello on pc-asa: [Errno 2] No such file or directory
problem with execution of mpihello on pc-asa: [Errno 2] No such file or directory
problem with execution of mpihello on pc-asa: [Errno 2] No such file or directory

any help.
Thanks

Hi meteocat10,

If “pc-asa” is a remote system, check that you are running in a shared directory. Also, you may need to run using the fully qualified PATH to mpihello.

If “pc-asa” is your local system, then check that “.” is in your environment’s PATH variable or run using “./mpihello” or the fully qualified PATH.

Hope this helps,
Mat