mpdboot wait forever

I’m sorry for bombing the forum with quite many question on MPICH2. We’ve just installed it on the system and it seem doesn’t work properly.

I have configure .mpd.conf in my home folder. The content of mpd.hosts is the list of machines

nfat.binf
nfkb.binf
ryr.binf
dhpr

The connection method is ssh, this is the script I use to test

num=4
PGI=/opt/pgi
PGRSH=ssh


PATH=$bk_path:$PGI/linux86/10.5/bin:$PGI/linux86/10.5/mpi2/mpich/python32/bin/:$PGI/linux86/10.5/mpi2/mpich/bin
pgf90 -o mpihello_mpich2 -O2 -Mmpi=mpich2 mpihello.f
eval mpdboot -n $num -f mpd.hosts -m /opt/pgi/linux86/10.5/mpi2/mpich/bin/mpd -v --chkup
eval mpiexec -np $num ./mpihello_mpich2

mpdallexit
echo " – 64-bit"
PATH=$bk_path:$PGI/linux86-64/10.5/bin:$PGI/linux86-64/10.5/mpi2/mpich/bin/
pgf90 -o mpihello_mpich2 -O2 -Mmpi=mpich2 mpihello.f
eval mpdboot -r ssh -n $num -f mpd.hosts -m /opt/pgi/linux86-64/10.5/mpi2/mpich/bin/mpd -v --chkup -1
eval mpiexec -np $num /home/minhtuan/mpitest/mpihello_mpich2

mpdallexit

The result is as follows

– 32-bit
running mpdallexit on nfat.binf
LAUNCHED mpd on nfat.binf via
RUNNING: mpd on nfat.binf
LAUNCHED mpd on nfkb.binf via nfat.binf
LAUNCHED mpd on ryr.binf via nfat.binf
LAUNCHED mpd on dhpr.binf via nfat.binf
mpdboot_nfat.binf (handle_mpd_output 374): failed to ping mpd on ryr.binf; recvd output={}

mpiexec_nfat.binf cannot connect to local mpd (/tmp/mpd2.console_minhtuan); possible causes:

  1. no mpd is running on this host
  2. an mpd is running but was started without a “console” (-n option)
    In case 1, you can start an mpd on this host with:
    mpd &
    and you will be able to run jobs just on this host.
    For more details on starting mpds on a set of hosts, see
    the MPICH2 Installation Guide.
    mpdallexit: cannot connect to local mpd (/tmp/mpd2.console_minhtuan); possible causes:
  3. no mpd is running on this host
  4. an mpd is running but was started without a “console” (-n option)
    In case 1, you can start an mpd on this host with:
    mpd &
    and you will be able to run jobs just on this host.
    For more details on starting mpds on a set of hosts, see
    the MPICH2 Installation Guide.
    – 64-bit
    checking nfkb.binf
    checking ryr.binf
    checking dhpr.binf
    there are 4 hosts up (counting local)
    running mpdallexit on nfat.binf
    LAUNCHED mpd on nfat.binf via
    RUNNING: mpd on nfat.binf
    LAUNCHED mpd on nfkb.binf via nfat.binf
    LAUNCHED mpd on ryr.binf via nfat.binf
    LAUNCHED mpd on dhpr.binf via nfat.binf


    ^Z
    [2]+ Stopped ./tuan_script

It hangs at the last LAUNCHED line. And it cannot connect to mpd on the remote machine. It seems that those remote mpd haven’t join the ring yet. Could you please give me a solution.

Thanks,
Tuan

Hi,

Before running a real application. I recommend testing if the daemons start properly by starting the daemon with mpdboot then call mpdtrace to see if all hosts have started its daemon.

Here is an example:

h1% cat mpd.hosts
h2
h3

h1% mpdboot --totalnum=3

h1% mpdtrace
h1
h3
h2


Note that you don’t need to put the hostname where you invoke in mpd.hosts. It will automatically starts the daemon on the host where you invoke mpdboot.

Hongyon

Hi Hongyon,

with

%mpdboot
%mpdtrace

mpdtrace return the host name of the current node.

with

%mpdboot --totalnum=2

mpdboot wait forever, it doesn’t return. I don’t know what’s happening. If I add -v option, it print out

running mpdallexit on nfat.binf
LAUNCHED mpd on nfat.binf via
RUNNING: mpd on nfat.binf
LAUNCHED mpd on nfkb.binf via nfat.binf

and hang at this point.


Tuan

Did you run this in a script? Can you run this by typing at the command prompt to see if it also hang? Just type without eval or its arguments:

%mpdboot --totalnum=3

Make sure you mpd.hosts contains valid hostnames.

Also, ssh to the all hosts to see if you can ssh. Do you need to enter password then you ssh to those hosts? If so, then you will need to enter password then you start the daemons. Perhaps it does not hang but just waiting for password.

You can also try:
%mpdboot --totalnum=3 --rsh=rsh

So that it uses rsh instead of ssh if you can rsh without password.

Hongyon

Hi Hongyon,

I’m running from terminal. I run the same command line as you told me.

Make sure you mpd.hosts contains valid hostnames.

Also, ssh to the all hosts to see if you can ssh. Do you need to enter password then you ssh to those hosts? If so, then you will need to enter password then you start the daemons. Perhaps it does not hang but just waiting for password.

Suppose NFAT is the master node, I generated a key using ssh-keygen and copy to all other hosts, so that I can login to them from NFAT, without typing any password.

All machines

You can also try:
%mpdboot --totalnum=3 --rsh=rsh

So that it uses rsh instead of ssh if you can rsh without password.

Hongyon

I can login to all machines using either rsh and ssh. Using --rsh=rsh doesn’t give me any different result, though.

I try to use -v -c to see if there is any helpful message.

minhtuan@nfat:mpitest_jafri\ $mpdboot --totalnum=4 --rsh=ssh -v -c
checking nfkb.binf
checking ryr.binf
checking dhpr.binf
there are 4 hosts up (counting local)
running mpdallexit on nfat.binf
LAUNCHED mpd on nfat.binf via
RUNNING: mpd on nfat.binf
LAUNCHED mpd on nfkb.binf via nfat.binf
LAUNCHED mpd on ryr.binf via nfat.binf
LAUNCHED mpd on dhpr.binf via nfat.binf

and it doesn’t return to the prompt.

Thanks,
Tuan

I realized there are many mpd processes running. After kill them all, in all machines, I rerun the script
and get this error message.

$mpdboot -n 2
mpdboot_nfat.binf (handle_mpd_output 374): failed to ping mpd on nfkb.binf; recvd output={}

If I add -v option

%mpdboot -n 2 -v
running mpdallexit on nfat.binf
LAUNCHED mpd on nfat.binf via
RUNNING: mpd on nfat.binf
LAUNCHED mpd on nfkb.binf via nfat.binf
mpdboot_nfat.binf (handle_mpd_output 374): failed to ping mpd on nfkb.binf; recvd output={}

I hope with this message, you will know what the problem is.

Thanks,
Tuan

From what I read from mpich2, it is possible that the network configuration is the issue, i.e., /etc/hosts file. What OS do you run on? From search on the web, I found that it is Centos( and maybe some other os too), that could be the problem.


Anyhow, please try this:
Make sure to kill all daemons first.

  1. What are the names listed in mpd.hosts file?
  2. ping (and ssh) back and forth between those exact hostnames in your mpd.hosts and master node.
  3. Log in to those and type at the command prompt: hostname. They should be the same as listed on mpd.hosts file. If not, then you will need to reconfigure so that they are the same.



    Hongyon

Hi Hongyon,
I’ve just sent an email to trs@pgroup.com and ask them to forward my /etc/hosts (both from the master nodes and slave nodes) to you and Mat. I hope you can figure out something by looking at that.
I hope to receive good news from you soon.

Thanks,
Tuan.