mpich problem

I’m using PGI 10.4 with mpich to build the WRF model on a 64-node Linux cluster (centos 5.2, 32-bit). When I run the model I get the following error:

>>mpirun -np 4 -machinefile ./real.hosts ./real.exe rm_7555:  p4_error: rm_start: net_conn_to_listener failed: 42887
p0_17719:  p4_error: Child process exited while making connection to remote process on troy-myr02: 0
Killed by signal 2.
Killed by signal 2.
p0_17719: (25.562833) net_send: could not write to fd=4, errno = 32

I’ve built the mpihello executable mentioned in this post (http://www.pgroup.com/userforum/viewtopic.php?p=7484&sid=f69e37c18ff04b0a64a314f55e164919 as a simple test. If I restrict the host names in the machinefile to the local machine it will run properly. If I attempt to span the run across other hosts I get the following message:

mpirun -np 4 -machinefile /home/wrfuser/hosts.test /home/wrfuser/mpihello
rm_7516:  p4_error: rm_start: net_conn_to_listener failed: 52608
p0_17408:  p4_error: Child process exited while making connection to remote process on troy-g02: 0
Killed by signal 2.
Killed by signal 2.
p0_17408: (25.507350) net_send: could not write to fd=4, errno = 32

I’m able to ssh to all of the nodes without a password successfully and execute commands (ie.

ssh troy-g02 uptime

works). No weird text when logging in remotely. I’m at a loss as to how to get things running. Any help would be appreciated.

Hi tuckerm,

My best guess is that your MPICH install is configured to use rsh and your cluster isn’t set-up to use rsh.

  • Mat