I’m hoping that someone has some useful advice to help me fix an issue in my environment.
We have a Mellanox HA fabric made up of 4 SB7700 switches (ib1 - ib4). Something seems to have happened to stop failover somewhere along the line. Our ib3 switch is the SM-HA master with the other switches running in standby. However, our master switch is not actually running the SM. It’s enabled, but in the “stopped” state. I don’t see a way to force it to restart or to force another switch to become master.
Also, I’m a little nervous about rebooting ib3 to try to hope another switch becomes master. We’re using this IB fabric as the backbone for a GPFS system. One of the quorum servers already dropped out of the network because its port appears to have died on ib3 - but we can’t check the ports with the SM not running (and new connections to the switch aren’t recognized or initiated unless the SM on the switch is active). Also, without the SM running on the master switch, it appears that the servers still in the GPFS cluster no longer see the other switches. I say that because running “ibswitches” on any of the nodes still in the cluster only report the ib3 switch. Whereas the rebooted quorum node that lost connection to ib3 sees all the other switches - but that server can’t connect to any other node over the IB network, since the others are only paying attention to ib3 right now.
Is there any way to force a change in the SM master (without rebooting the master switch) or to try to restart the SM on ib3? SM is enabled, just not running. With the risk of losing quorum in my GPFS cluster if ib3 goes down without the servers seeing another switch, I’m worried that there’s no way forward without planning a major outage, stopping GPFS everywhere, rebooting ib3, and just hoping I’ll be able to restart GPFS afterwards. That’s not a prospect I’d like to have to plan for.
So, any thoughts or advice for how to get my fabric out of this situation would be GREATLY appreciated.
Please note that in the “show ib smnodes” long output, the entry for cab-ib3 shows “SM Running : stopped” while the others all show “SM Running : running” as expected.
Let me try that again in pre-formatted test, since the system interpreted the underlines as bold setting for the lines above:
cab-ib3 [ab-ib-cluster: master] # show ib smnodes brief
HA state of switch infiniband-default
========================================
IB Subnet HA name: ab-ib-cluster
HA IP address: 172.20.0.107/24
Active HA nodes: 4
ID Local node SM-HA state IP SM Priority
------------------------------------------------------------------------------------------
cab-ib3 * master 172.20.0.105 enabled 10
cab-ib1 standby 172.20.0.103 enabled 7
cab-ib2 standby 172.20.0.104 enabled 1
cab-ib4 standby 172.20.0.106 enabled 4