problem with MPI communication

Have compiled an MPI code. Tried to run it. It failed as follows:

$ mpirun -np 2 ./a.out
ssh_exchange_identification: Connection closed by remote host
p0_24123:  p4_error: Child process exited while making connection to remote process on ode4: 0
p0_24123: (7.015625) net_send: could not write to fd=4, errno = 32

MPI was working on this machine recently. But there have been a couple of reboots and a re-install of PGI, so I guess something has changed. I ran into this in the distant past and found a solution. But the heck if I can remember, or find, what it that solution was. Anybody have any ideas? Thanks!

If anybody cares about this, I finally got it. I had to edit
/opt/pgi/linux86-64/12.3/mpi/mpich/share/machines.LINUX and add the IP address – not the hostname, but the numerical IP address – of my machine, once for each processor. I remember this now from some time ago, and I think this points to some kind of problem with the system, so this is not a problem that everyone is going to have. But if you run into this and nothing else works, give it a try! Also, need to similarly edit
/opt/pgi/linux86/12.3/mpi/mpich/share/machines.LINUX

This means that the node can resolve host names to IP address. Other solution are to add the host and IP address to at “/etc/hosts” file, or set-up an DNS or DHCP server.

  • Mat