Open MPI network setup

I have three Jetson Nano 4GB A02 boards installed with the 4.6.1-2022/02/23
sdcard image. Setup static IP addresses and installed OpenMP. ssh, scp and
ping work between the systems. However, when I run the CUDA simpleMPI
example program the second node (192.168.1.52) reports:

mpiexec --hostfile …/clusterfile ./simpleMPI
Running on 12 nodes

Open MPI detected an inbound MPI TCP connection request from a peer
that appears to be part of this MPI job (i.e., it identified itself as
part of this Open MPI job), but it is from an IP address that is
unexpected. This is highly unusual.

The inbound connection has been dropped, and the peer should simply
try again with a different IP interface (i.e., the job should
hopefully be able to continue).

Local host: nano2
Local PID: 6963
Peer hostname: nano2 ([[52067,1],4])
Source IP of socket: 192.168.1.52
Known IPs of peer:

/etc/hosts on that system:

127.0.0.1 localhost

127.0.1.1 nano2

192.168.1.52 nano2

The following lines are desirable for IPv6 capable hosts

::1 ip6-localhost ip6-loopback

fe00::0 ip6-localnet

ff00::0 ip6-mcastprefix

ff02::1 ip6-allnodes

ff02::2 ip6-allrouters

192.168.1.51 nano1

192.168.1.53 nano3

Is there something I’m missing in the network setup?

It turns out when the other nodes try to communicate with processes on the same node they are using the docker0 interface instead of eth0. The Open MPI code rejects the connection. I solved this by disabling docker and now
simpleMPI runs on a multi-node Jetson Nano cluster.

$ sudo service docker stop
$ sudo systemctl stop docker
$ sudo ip link delete docker0

Thanks for your sharing!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.