I’m using PGI 10.4 with mpich to build the WRF model on a 64-node Linux cluster (centos 5.2, 32-bit). When I run the model I get the following error:
>>mpirun -np 4 -machinefile ./real.hosts ./real.exe rm_7555: p4_error: rm_start: net_conn_to_listener failed: 42887
p0_17719: p4_error: Child process exited while making connection to remote process on troy-myr02: 0
Killed by signal 2.
Killed by signal 2.
p0_17719: (25.562833) net_send: could not write to fd=4, errno = 32
I’ve built the mpihello executable mentioned in this post (http://www.pgroup.com/userforum/viewtopic.php?p=7484&sid=f69e37c18ff04b0a64a314f55e164919 as a simple test. If I restrict the host names in the machinefile to the local machine it will run properly. If I attempt to span the run across other hosts I get the following message:
mpirun -np 4 -machinefile /home/wrfuser/hosts.test /home/wrfuser/mpihello
rm_7516: p4_error: rm_start: net_conn_to_listener failed: 52608
p0_17408: p4_error: Child process exited while making connection to remote process on troy-g02: 0
Killed by signal 2.
Killed by signal 2.
p0_17408: (25.507350) net_send: could not write to fd=4, errno = 32
I’m able to ssh to all of the nodes without a password successfully and execute commands (ie.
ssh troy-g02 uptime
works). No weird text when logging in remotely. I’m at a loss as to how to get things running. Any help would be appreciated.