Not getting connection anymore between 4036 switch and hosts

Dear Support,

We’re running a 35 nodes cluster which was working perfectly for several years but recently, one of our admin ran the HP upgrade tool and it has updated the firmware of the infiniband card of the servers

Since then most of the links are not getting established and remain in Polling, however one of the nodes seems to be able to connect even with this new firmware.

We’re struggling diagnosing the issue (is this really the firmware upgrade, can we rollback, should be upgrade the switches, etc.), and how to address it without changing the whole setup (drivers, os, firmware,…)

Rebooting the switches is having no effect, swapping cables makes the server properly connecting via the other port so it seems connected to the nodes themselves

Some hints would be greatly appreciated

-jd

Working node with old firmware:

[root@***s29 ~]# ibstat

CA ‘mlx4_0’

CA type: MT4099

Number of ports: 2

Firmware version: 2.10.2350

Hardware version: 0

Node GUID: **

System image GUID: ***

Port 1:

State: Active

Physical state: LinkUp

Rate: 40

Base lid: 35

LMC: 0

SM lid: 2

Capability mask: **

Port GUID: ***

Link layer: InfiniBand

We’re using the following cards (from HP)

[root@****~]# lspci | grep Mell

07:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]

[root@***s03 ~]# ibstat

CA ‘mlx4_0’

CA type: MT4099

Number of ports: 2

Firmware version: 2.36.5000

Hardware version: 0

Node GUID: **

System image GUID: **

Port 1:

State: Down

Physical state: Polling

Rate: 40

Base lid: 0

LMC: 0

SM lid: 0

Capability mask: **

Port GUID: **

Link layer: InfiniBand

***# module-firmware show

Module No. Type Node GUID LID FW Version SW Version


4036/2036 3.6.2-872

I would try to contact HP support or Mellanox support to get the fw version that does work for you