Dear NVIDIA developers and users,
so we have this small Infiniband network made of three unmanaged MQM8790 switches and a couple of nodes.
The (simplified) topology is show in the attached picture.
The three switches are set up in a ring topology (allowed according to the manual) connected with 5x HDR200 links.
There are furthermore in total ten nodes which are equipped with two ConnectX-6 dual-port NICs (OS[1…8] and MDS1/2). XFER1/2 also have dual-port cards but only one Infiniband link.
There is furthermore a 100GBit/s network built of SN2700 switches.
For maximum redundancy, we decided to split the ports of the ConnectX-6 cards so that each card has one ETH and one IB port. The IB ports are furthermore connected to two different switches.
As the switches are unmanaged, there is an OpenSM running on the node XFER1 (XFER2 serves as backup-SM).
This works in principle all well on the IB link layer. All links are up (
iblinkinfo
), ibqueryerror
finds no errors, and with ibping
we can ping on the IB layer.The trouble starts when we configure IP adresses on top of the IB links(*). The IB interfaces get IP adresses 172.3.201.1/16 on ib0, 172.3.201.11/16 on ib1 on OS1, 172.3.201.2/16 on ib0, 172.3.201.12/16 on ib1 on OS2… and so on. I.e. all interfaces are in the same /16 subnet (**). Initially, this also works well and we can ping over the IPoIB net, but when the network settles for some time, we start to observe seemingly random ping losses as if the connection was broken. Sometimes this affects only one of the interfaces of a node, sometimes both interfaces. If one keeps pinging for a few seconds, the connection suddenly comes back as if the route was revived.
There are no IB layer errors, there are no conclusive log entries, there are no log entries in the SM log that point to a problem.
Firmware on the NICs and the switches is up to date. OS is Rocky Linux 8.10.
The problem seems to be independent of the driver (Mellanox core drivers shipped with Rocky Linux, MLNX-OFED 23.10 or DOCA Networking).
We are grateful for any hint on how to diagnose and resolve this issue.
(*) You may ask why we are configuring IPoIB and why unstable IPoIP is a problem: we plan to deploy a Lustre file system on these nodes, Lustre initially needs IP connectivity even on the o2ib interfaces even if it later switches to RDMA. With unreliable IPoIB connections, Lustre often simply fails to bring up the LNet and dies.
(**) The 100G ETH network uses a completely disjoint /16 net of course.