Hi,
We have a cross generational infiniband (IB) network, which ranges from QDR to NDR. We are currently running opensm 5.20.0.MLNX20240804.ef1f438a, with the default settings on a node as we only have unmanaged switches.
Recently while adding new NDR nodes to the network, I saw an error where I see for some of these new nodes:
324 37[ ] ==( 4X 106.25 Gbps Active/ LinkUp)==> 337 1[ ] "node317 ibp66s0" ( )
324 38[ ] ==( 1X 2.5 Gbps Active/ LinkUp)==> 336 1[ ] "node318 ibp66s0" (Could be 4X )
324 39[ ] ==( Down/ Polling)==> [ ] "" ( )
324 40[ ] ==( Down/ Polling)==> [ ] "" ( )
324 41[ ] ==( 1X 2.5 Gbps Active/ LinkUp)==> 332 1[ ] "node312 ibp66s0" (Could be 4X )
324 42[ ] ==( 1X 2.5 Gbps Active/ LinkUp)==> 333 1[ ] "node311 ibp66s0" (Could be 4X )
324 43[ ] ==( 1X 2.5 Gbps Active/ LinkUp)==> 335 1[ ] "node315 ibp66s0" (Could be 4X )
324 44[ ] ==( 4X 106.25 Gbps Active/ LinkUp)==> 334 1[ ] "node316 ibp66s0" ( )
324 45[ ] ==( 1X 2.5 Gbps Active/ LinkUp)==> 329 1[ ] "node309 ibp66s0" (Could be 4X )
324 46[ ] ==( 1X 2.5 Gbps Active/ LinkUp)==> 328 1[ ] "node310 ibp66s0" (Could be 4X )
324 47[ ] ==( 1X 2.5 Gbps Active/ LinkUp)==> 331 1[ ] "node314 ibp66s0" (Could be 4X )
324 48[ ] ==( 4X 106.25 Gbps Active/ LinkUp)==> 330 1[ ] "node313 ibp66s0" ( )
324 49[ ] ==( 1X 2.5 Gbps Active/ LinkUp)==> 315 1[ ] "node303 ibp66s0" (Could be 4X )
For these nodes in particular, if I do ibstat
ibstat
CA 'ibp66s0'
CA type: MT4129
Number of ports: 1
Firmware version: 28.40.1000
Hardware version: 0
Node GUID: 0xa088c203006e323e
System image GUID: 0xa088c203006e323e
Port 1:
State: Active
Physical state: LinkUp
Rate: 2.5
Base lid: 343
LMC: 0
SM lid: 312
Capability mask: 0xa751e848
Port GUID: 0xa088c203006e323e
Link layer: InfiniBand
Whereas for the functional nodes, the rate is 400. This is an issue as if I put these nodes with slower speed into production, an mpi job which runs on 1 bad and 1 good node, somehow manages to trigger constant heavy sweeps from the opensm, which leads to the network becoming unresponsive.
On my infiniband cards on the nodes itself, I see one LED is orange and one is green, which from the documentation means that the card is not operating at maximum speed. To check if it is a cable issue or a switch issue I tried the following:
I took one of these nodes, and replaced its cable with one which is working fine, the error remained
Then I took the node, unplugged it from my network, installed opensm on that node, and plugged it into a spare NDR unmanaged switch that is NOT connect to anything else. This fixed the issue as now the node had full speed. I stopped opensm on the node, uninstalled it, unplugged it from the isolated switch and once again plugged it back into our network, this time, the node remained at the rate of400. Now my question is, is there a way I can cause my opensm to do a rescan of these slower nodes to get them to full speed or not?
I also tried:
ibportstate 343 1 width 4x
Initial CA/RT PortInfo:
# Port info: Lid 343 port 1
LinkState:.......................Active
PhysLinkState:...................LinkUp
Lid:.............................343
SMLid:...........................312
LMC:.............................0
LinkWidthSupported:..............1X or 4X or 2X
LinkWidthEnabled:................1X or 4X or 2X
LinkWidthActive:.................1X
LinkSpeedSupported:..............2.5 Gbps
LinkSpeedEnabled:................2.5 Gbps
LinkSpeedActive:.................2.5 Gbps
LinkSpeedExtSupported:...........0
LinkSpeedExtEnabled:.............0
LinkSpeedExtActive:..............No Extended Speed
Mkey:............................<not displayed>
MkeyLeasePeriod:.................0
ProtectBits:.....................0
ibportstate: iberror: failed: smp set portinfo failed