– 32-bit
running mpdallexit on nfat.binf
LAUNCHED mpd on nfat.binf via
RUNNING: mpd on nfat.binf
LAUNCHED mpd on nfkb.binf via nfat.binf
LAUNCHED mpd on ryr.binf via nfat.binf
LAUNCHED mpd on dhpr.binf via nfat.binf
mpdboot_nfat.binf (handle_mpd_output 374): failed to ping mpd on ryr.binf; recvd output={}
mpiexec_nfat.binf cannot connect to local mpd (/tmp/mpd2.console_minhtuan); possible causes:
no mpd is running on this host
an mpd is running but was started without a “console” (-n option)
In case 1, you can start an mpd on this host with:
mpd &
and you will be able to run jobs just on this host.
For more details on starting mpds on a set of hosts, see
the MPICH2 Installation Guide.
mpdallexit: cannot connect to local mpd (/tmp/mpd2.console_minhtuan); possible causes:
no mpd is running on this host
an mpd is running but was started without a “console” (-n option)
In case 1, you can start an mpd on this host with:
mpd &
and you will be able to run jobs just on this host.
For more details on starting mpds on a set of hosts, see
the MPICH2 Installation Guide.
– 64-bit
checking nfkb.binf
checking ryr.binf
checking dhpr.binf
there are 4 hosts up (counting local)
running mpdallexit on nfat.binf
LAUNCHED mpd on nfat.binf via
RUNNING: mpd on nfat.binf
LAUNCHED mpd on nfkb.binf via nfat.binf
LAUNCHED mpd on ryr.binf via nfat.binf
LAUNCHED mpd on dhpr.binf via nfat.binf
^Z
[2]+ Stopped ./tuan_script
It hangs at the last LAUNCHED line. And it cannot connect to mpd on the remote machine. It seems that those remote mpd haven’t join the ring yet. Could you please give me a solution.
Before running a real application. I recommend testing if the daemons start properly by starting the daemon with mpdboot then call mpdtrace to see if all hosts have started its daemon.
Here is an example:
h1% cat mpd.hosts
h2
h3
h1% mpdboot --totalnum=3
h1% mpdtrace
h1
h3
h2
Note that you don’t need to put the hostname where you invoke in mpd.hosts. It will automatically starts the daemon on the host where you invoke mpdboot.
Did you run this in a script? Can you run this by typing at the command prompt to see if it also hang? Just type without eval or its arguments:
%mpdboot --totalnum=3
Make sure you mpd.hosts contains valid hostnames.
Also, ssh to the all hosts to see if you can ssh. Do you need to enter password then you ssh to those hosts? If so, then you will need to enter password then you start the daemons. Perhaps it does not hang but just waiting for password.
You can also try:
%mpdboot --totalnum=3 --rsh=rsh
So that it uses rsh instead of ssh if you can rsh without password.
I’m running from terminal. I run the same command line as you told me.
Make sure you mpd.hosts contains valid hostnames.
Also, ssh to the all hosts to see if you can ssh. Do you need to enter password then you ssh to those hosts? If so, then you will need to enter password then you start the daemons. Perhaps it does not hang but just waiting for password.
Suppose NFAT is the master node, I generated a key using ssh-keygen and copy to all other hosts, so that I can login to them from NFAT, without typing any password.
All machines
You can also try:
%mpdboot --totalnum=3 --rsh=rsh
So that it uses rsh instead of ssh if you can rsh without password.
Hongyon
I can login to all machines using either rsh and ssh. Using --rsh=rsh doesn’t give me any different result, though.
I try to use -v -c to see if there is any helpful message.
minhtuan@nfat:mpitest_jafri\ $mpdboot --totalnum=4 --rsh=ssh -v -c
checking nfkb.binf
checking ryr.binf
checking dhpr.binf
there are 4 hosts up (counting local)
running mpdallexit on nfat.binf
LAUNCHED mpd on nfat.binf via
RUNNING: mpd on nfat.binf
LAUNCHED mpd on nfkb.binf via nfat.binf
LAUNCHED mpd on ryr.binf via nfat.binf
LAUNCHED mpd on dhpr.binf via nfat.binf
I realized there are many mpd processes running. After kill them all, in all machines, I rerun the script
and get this error message.
$mpdboot -n 2
mpdboot_nfat.binf (handle_mpd_output 374): failed to ping mpd on nfkb.binf; recvd output={}
If I add -v option
%mpdboot -n 2 -v
running mpdallexit on nfat.binf
LAUNCHED mpd on nfat.binf via
RUNNING: mpd on nfat.binf
LAUNCHED mpd on nfkb.binf via nfat.binf
mpdboot_nfat.binf (handle_mpd_output 374): failed to ping mpd on nfkb.binf; recvd output={}
I hope with this message, you will know what the problem is.
From what I read from mpich2, it is possible that the network configuration is the issue, i.e., /etc/hosts file. What OS do you run on? From search on the web, I found that it is Centos( and maybe some other os too), that could be the problem.
Anyhow, please try this:
Make sure to kill all daemons first.
What are the names listed in mpd.hosts file?
ping (and ssh) back and forth between those exact hostnames in your mpd.hosts and master node.
Log in to those and type at the command prompt: hostname. They should be the same as listed on mpd.hosts file. If not, then you will need to reconfigure so that they are the same.
Hi Hongyon,
I’ve just sent an email to trs@pgroup.com and ask them to forward my /etc/hosts (both from the master nodes and slave nodes) to you and Mat. I hope you can figure out something by looking at that.
I hope to receive good news from you soon.