SM-HA master not running SM

I’m hoping that someone has some useful advice to help me fix an issue in my environment.

We have a Mellanox HA fabric made up of 4 SB7700 switches (ib1 - ib4). Something seems to have happened to stop failover somewhere along the line. Our ib3 switch is the SM-HA master with the other switches running in standby. However, our master switch is not actually running the SM. It’s enabled, but in the “stopped” state. I don’t see a way to force it to restart or to force another switch to become master.

Also, I’m a little nervous about rebooting ib3 to try to hope another switch becomes master. We’re using this IB fabric as the backbone for a GPFS system. One of the quorum servers already dropped out of the network because its port appears to have died on ib3 - but we can’t check the ports with the SM not running (and new connections to the switch aren’t recognized or initiated unless the SM on the switch is active). Also, without the SM running on the master switch, it appears that the servers still in the GPFS cluster no longer see the other switches. I say that because running “ibswitches” on any of the nodes still in the cluster only report the ib3 switch. Whereas the rebooted quorum node that lost connection to ib3 sees all the other switches - but that server can’t connect to any other node over the IB network, since the others are only paying attention to ib3 right now.

Is there any way to force a change in the SM master (without rebooting the master switch) or to try to restart the SM on ib3? SM is enabled, just not running. With the risk of losing quorum in my GPFS cluster if ib3 goes down without the servers seeing another switch, I’m worried that there’s no way forward without planning a major outage, stopping GPFS everywhere, rebooting ib3, and just hoping I’ll be able to restart GPFS afterwards. That’s not a prospect I’d like to have to plan for.

So, any thoughts or advice for how to get my fabric out of this situation would be GREATLY appreciated.

Hi,
You can log in to the vip, and give me output of command:
show ib smnodes brief
Thanks,
Suo

Suo,

Thank you for the response. The output you requested is given below:

cab-ib3 [cab-ib-cluster: master] # show ib smnodes brief

HA state of switch infiniband-default

IB Subnet HA name: cab-ib-cluster
HA IP address: 172.20.0.107/24
Active HA nodes: 4

ID Local node SM-HA state IP SM Priority

cab-ib3 * master 172.20.0.105 enabled 10
cab-ib1 standby 172.20.0.103 enabled 7
cab-ib2 standby 172.20.0.104 enabled 1
cab-ib4 standby 172.20.0.106 enabled 4

Please note that in the “show ib smnodes” long output, the entry for cab-ib3 shows “SM Running : stopped” while the others all show “SM Running : running” as expected.

Let me try that again in pre-formatted test, since the system interpreted the underlines as bold setting for the lines above:

cab-ib3 [ab-ib-cluster: master] # show ib smnodes brief

HA state of switch infiniband-default
========================================
IB Subnet HA name: ab-ib-cluster
HA IP address:     172.20.0.107/24
Active HA nodes:   4

 ID           Local node        SM-HA state   IP              SM                 Priority
------------------------------------------------------------------------------------------
 cab-ib3      *                 master        172.20.0.105     enabled            10
 cab-ib1                        standby       172.20.0.103     enabled            7
 cab-ib2                        standby       172.20.0.104     enabled            1
 cab-ib4                        standby       172.20.0.106     enabled            4

Please provide:

QM9700-2 [standalone: master] (config) # show ib smnode QM9700-2 sm-state
QM9700-2 [standalone: master] (config) # show ib smnode QM9700-2 sm-running

Showing output to the commands for each of my 4 switches. (As a reminder, cab-ib3 is my master.)

cab-ib3 [ab-ib-cluster: master] (config) # show ib smnode cab-ib1 sm-state
enabled
cab-ib3 [ab-ib-cluster: master] (config) # show ib smnode cab-ib1 sm-running
active

cab-ib3 [ab-ib-cluster: master] (config) # show ib smnode cab-ib2 sm-state
enabled
cab-ib3 [ab-ib-cluster: master] (config) # show ib smnode cab-ib2 sm-running
active

cab-ib3 [ab-ib-cluster: master] (config) # show ib smnode cab-ib3 sm-state
enabled
cab-ib3 [ab-ib-cluster: master] (config) # show ib smnode cab-ib3 sm-running
not active

cab-ib3 [ab-ib-cluster: master] (config) # show ib smnode cab-ib4 sm-state
enabled
cab-ib3 [ab-ib-cluster: master] (config) # show ib smnode cab-ib4 sm-running
active


Please try to use the following command to reset opensm on ib3
#ib smnode cab-ib3 disable
#ib smnode cab-ib3 enable

Then check the status

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.