Issue with opensm and NDR Speed

Hi,
We have a cross generational infiniband (IB) network, which ranges from QDR to NDR. We are currently running opensm 5.20.0.MLNX20240804.ef1f438a, with the default settings on a node as we only have unmanaged switches.

Recently while adding new NDR nodes to the network, I saw an error where I see for some of these new nodes:

         324   37[  ] ==( 4X        106.25 Gbps Active/  LinkUp)==>     337    1[  ] "node317 ibp66s0" ( )
         324   38[  ] ==( 1X           2.5 Gbps Active/  LinkUp)==>     336    1[  ] "node318 ibp66s0" (Could be 4X )
         324   39[  ] ==(                Down/ Polling)==>             [  ] "" ( )
         324   40[  ] ==(                Down/ Polling)==>             [  ] "" ( )
         324   41[  ] ==( 1X           2.5 Gbps Active/  LinkUp)==>     332    1[  ] "node312 ibp66s0" (Could be 4X )
         324   42[  ] ==( 1X           2.5 Gbps Active/  LinkUp)==>     333    1[  ] "node311 ibp66s0" (Could be 4X )
         324   43[  ] ==( 1X           2.5 Gbps Active/  LinkUp)==>     335    1[  ] "node315 ibp66s0" (Could be 4X )
         324   44[  ] ==( 4X        106.25 Gbps Active/  LinkUp)==>     334    1[  ] "node316 ibp66s0" ( )
         324   45[  ] ==( 1X           2.5 Gbps Active/  LinkUp)==>     329    1[  ] "node309 ibp66s0" (Could be 4X )
         324   46[  ] ==( 1X           2.5 Gbps Active/  LinkUp)==>     328    1[  ] "node310 ibp66s0" (Could be 4X )
         324   47[  ] ==( 1X           2.5 Gbps Active/  LinkUp)==>     331    1[  ] "node314 ibp66s0" (Could be 4X )
         324   48[  ] ==( 4X        106.25 Gbps Active/  LinkUp)==>     330    1[  ] "node313 ibp66s0" ( )
         324   49[  ] ==( 1X           2.5 Gbps Active/  LinkUp)==>     315    1[  ] "node303 ibp66s0" (Could be 4X )

For these nodes in particular, if I do ibstat

ibstat
CA 'ibp66s0'
        CA type: MT4129
        Number of ports: 1
        Firmware version: 28.40.1000
        Hardware version: 0
        Node GUID: 0xa088c203006e323e
        System image GUID: 0xa088c203006e323e
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 2.5
                Base lid: 343
                LMC: 0
                SM lid: 312
                Capability mask: 0xa751e848
                Port GUID: 0xa088c203006e323e
                Link layer: InfiniBand

Whereas for the functional nodes, the rate is 400. This is an issue as if I put these nodes with slower speed into production, an mpi job which runs on 1 bad and 1 good node, somehow manages to trigger constant heavy sweeps from the opensm, which leads to the network becoming unresponsive.

On my infiniband cards on the nodes itself, I see one LED is orange and one is green, which from the documentation means that the card is not operating at maximum speed. To check if it is a cable issue or a switch issue I tried the following:
I took one of these nodes, and replaced its cable with one which is working fine, the error remained
Then I took the node, unplugged it from my network, installed opensm on that node, and plugged it into a spare NDR unmanaged switch that is NOT connect to anything else. This fixed the issue as now the node had full speed. I stopped opensm on the node, uninstalled it, unplugged it from the isolated switch and once again plugged it back into our network, this time, the node remained at the rate of400. Now my question is, is there a way I can cause my opensm to do a rescan of these slower nodes to get them to full speed or not?

I also tried:

ibportstate 343 1 width 4x

Initial CA/RT PortInfo:
# Port info: Lid 343 port 1
LinkState:.......................Active
PhysLinkState:...................LinkUp
Lid:.............................343
SMLid:...........................312
LMC:.............................0
LinkWidthSupported:..............1X or 4X or 2X
LinkWidthEnabled:................1X or 4X or 2X
LinkWidthActive:.................1X
LinkSpeedSupported:..............2.5 Gbps
LinkSpeedEnabled:................2.5 Gbps
LinkSpeedActive:.................2.5 Gbps
LinkSpeedExtSupported:...........0
LinkSpeedExtEnabled:.............0
LinkSpeedExtActive:..............No Extended Speed
Mkey:............................<not displayed>
MkeyLeasePeriod:.................0
ProtectBits:.....................0
ibportstate: iberror: failed: smp set portinfo failed

Just to add to this, when the network crashed, opensm.log was filled with:


Nov 08 00:04:50 650541 [A67606C0] 0x02 -> sa_mad_ctrl_process: Dropping MAD since the dispatcher is already overloaded with 176 messages and queue time of:109120[msec]
Nov 08 00:04:50 650573 [A67606C0] 0x02 -> sa_mad_ctrl_process: Dropping MAD since the dispatcher is already overloaded with 176 messages and queue time of:109120[msec]
Nov 08 00:04:50 656908 [A67606C0] 0x02 -> sa_mad_ctrl_process: Dropping MAD since the dispatcher is already overloaded with 176 messages and queue time of:109120[msec]
Nov 08 00:04:50 826683 [A67606C0] 0x02 -> sa_mad_ctrl_process: Dropping MAD since the dispatcher is already overloaded with 176 messages and queue time of:109120[msec]
Nov 08 00:04:50 969878 [A67606C0] 0x02 -> sa_mad_ctrl_process: Dropping MAD since the dispatcher is already overloaded with 176 messages and queue time of:109120[msec]
Nov 08 00:04:50 970494 [A67606C0] 0x02 -> sa_mad_ctrl_process: Dropping MAD since the dispatcher is already overloaded with 176 messages and queue time of:109120[msec]
Nov 08 00:04:50 976935 [A67606C0] 0x02 -> sa_mad_ctrl_process: Dropping MAD since the dispatcher is already overloaded with 176 messages and queue time of:109120[msec]
Nov 08 00:04:51 146831 [A67606C0] 0x02 -> sa_mad_ctrl_process: Dropping MAD since the dispatcher is already overloaded with 176 messages and queue time of:109120[msec]
Nov 08 00:04:51 289871 [A67606C0] 0x02 -> sa_mad_ctrl_process: Dropping MAD since the dispatcher is already overloaded with 176 messages and queue time of:109120[msec]
Nov 08 00:04:51 290497 [A67606C0] 0x02 -> sa_mad_ctrl_process: Dropping MAD since the dispatcher is already overloaded with 176 messages and queue time of:109120[msec]
Nov 08 00:04:51 297025 [A67606C0] 0x02 -> sa_mad_ctrl_process: Dropping MAD since the dispatcher is already overloaded with 176 messages and queue time of:109120[msec]
Nov 08 00:04:51 466624 [A67606C0] 0x02 -> sa_mad_ctrl_process: Dropping MAD since the dispatcher is already overloaded with 176 messages and queue time of:109120[msec]
Nov 08 00:04:51 609976 [A67606C0] 0x02 -> sa_mad_ctrl_process: Dropping MAD since the dispatcher is already overloaded with 176 messages and queue time of:109120[msec]

This was finally solved by updating the card firmware. Please note for this case, the cards and the nodes in question were manufactured by HPE. We still do not know why some nodes work and others did not out of the box.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.