I have a 21 node HPC with Infiniband as the primary interconnect. All of the nodes are connected to a single Mellanox SX6036 switch via DAC cables and all but one compute node is able to fully connect. The nodes are all running CentOS 7 and the compute nodes boot via PXE thanks to Bright Cluster Manager 8. All compute nodes are built by Supermicro and have built-in Connect-X 3 interfaces. The network itself has the SM on the switch and contains two partitions, the default, set for 10Gbps as a fail over, and one for 56Gbps FDR.
On the problem node I have already tried a different switch port, new cable, and new NIC. None changed what is happening.
The issue was first noticed from a mount failing that used the Infiniband interface. Checking the output of ip addr gave this.
4: ib0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc pfifo_fast state DOWN group default qlen 256 link/infiniband 80:00:02:08:fe:80:00:00:00:00:00:00:f4:52:14:03:00:f6:7c:41 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff inet 10.33.5.12/22 brd 10.33.7.255 scope global ib0 valid_lft forever preferred_lft forever
Checking the dmesg log I found the following with the last line repeated multiple times.
[ 190.555962] <mlx4_ib> mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand driver v4.0-0 [ 190.556807] <mlx4_ib> mlx4_ib_add: counter index 0 for port 1 allocated 0 [ 190.609814] mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0 [ 191.728874] ib0: enabling connected mode will cause multicast packet drops [ 191.728940] ib0: mtu > 4092 will cause multicast packet drops. [ 191.750468] IPv6: ADDRCONF(NETDEV_UP): ib0: link is not ready [ 236.874404] ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22
Here is the ibstatus output from the node.
Infiniband device 'mlx4_0' port 1 status: default gid: fe80:0000:0000:0000:f452:1403:00f6:7c41 base lid: 0x16 sm lid: 0x17 state: 4: ACTIVE phys state: 5: LinkUp rate: 56 Gb/sec (4X FDR) link_layer: InfiniBand
If anyone has any ideas I would be grateful. Also let me know if any more info is needed.