Dear Support,
We’re running a 35 nodes cluster which was working perfectly for several years but recently, one of our admin ran the HP upgrade tool and it has updated the firmware of the infiniband card of the servers
Since then most of the links are not getting established and remain in Polling, however one of the nodes seems to be able to connect even with this new firmware.
We’re struggling diagnosing the issue (is this really the firmware upgrade, can we rollback, should be upgrade the switches, etc.), and how to address it without changing the whole setup (drivers, os, firmware,…)
Rebooting the switches is having no effect, swapping cables makes the server properly connecting via the other port so it seems connected to the nodes themselves
Some hints would be greatly appreciated
-jd
Working node with old firmware:
[root@***s29 ~]# ibstat
CA ‘mlx4_0’
CA type: MT4099
Number of ports: 2
Firmware version: 2.10.2350
Hardware version: 0
Node GUID: **
System image GUID: ***
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 35
LMC: 0
SM lid: 2
Capability mask: **
Port GUID: ***
Link layer: InfiniBand
We’re using the following cards (from HP)
[root@****~]# lspci | grep Mell
07:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
[root@***s03 ~]# ibstat
CA ‘mlx4_0’
CA type: MT4099
Number of ports: 2
Firmware version: 2.36.5000
Hardware version: 0
Node GUID: **
System image GUID: **
Port 1:
State: Down
Physical state: Polling
Rate: 40
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: **
Port GUID: **
Link layer: InfiniBand
***# module-firmware show
Module No. Type Node GUID LID FW Version SW Version
4036/2036 3.6.2-872