Infiniband nic State down but physical state linkUp

I have two servers each with 1 infiniband nic connected to an infiniband switch running subnet manager, MQM8700-HS2F.
I see that the physical port state is linkUp, but the state is down.
When I try ib_send_bw, I see
user:~$ ib_send_bw
WARNING: BW peak won’t be measured in this run.
Port number 1 state is Down
Couldn’t set the link layer
Couldn’t get context for the device

Any help would be appreciated.

I have the following output from ibstat
CA ‘mlx5_1’
CA type: MT41692
Number of ports: 1
Firmware version: 32.42.1000
Hardware version: 1
Node GUID: 0x5c25730300e77133
System image GUID: 0x5c25730300e77132
Port 1:
State: Down
Physical state: LinkUp
Rate: 200
Base lid: 65535
LMC: 0
SM lid: 1
Capability mask: 0xa751ec48
Port GUID: 0x5c25730300e77133
Link layer: InfiniBand

And I ran ibdiagnet on the switch with the following output:
Running version: “IBDIAGNET 2.10.0.MLNX20220720.cd746c3”,“IBDIAG 2.1.1.cd746c3”,“IBDM 2.1.1.cd746c3”,“IBIS 7.0.0.c25850e”
Running command: /usr/bin/ibdiagnet
Running timestamp: 2024-11-07 14:05:38 UTC +0000

Switch label port numbering explanation:
Quantum2 switch split mode: ASIC/Cage/Port/Split, e.g 1/1/1/1
Quantum2 switch no split mode: ASIC/Cage/Port
Quantum switch split mode: Port/Split
Quantum switch no split mode: Port


Load Plugins from:
/usr/share/ibdiagnet2.1.1/plugins/
(You can specify more paths to be looked in with “IBDIAGNET_PLUGINS_PATH” env variable)

Plugin Name Result Comment
libibdiagnet_cable_diag_plugin-2.1.1 Succeeded Plugin loaded
libibdiagnet_phy_diag_plugin-2.1.1 Succeeded Plugin loaded


Discovery
-I- Start Fabric Discover
-I- Fill NodeDesc data
-I- NodeDesc finished successfully
-I- Fabric Discover finished successfully

-I- Fill PortInfo data
-I- PortInfo finished successfully

-I- No scope files. Total switches/ports [1/41], CAs/ports [2/2]
-I- Build VS Capability GMP
-I- VS Capability GMP finished successfully

-I- Build VS Capability SMP
-I- Build VS Capability FW Info SMP
-I- Build VS Capability Mask SMP
-I- VS Capability SMP finished successfully

-I- Build VS Extended Port Info
-I- VS ExtendedPortInfo finished successfully

-I- Build VS Port Info Extended
-I- Port Info Extended finished successfully

-I- Build Switch Info
-I- Switch Info retrieving finished successfully

-I- Build Hierarchy Info
-I- Hierarchy Info retrieving finished successfully

-I- Build AR Info
-I- AR Info retrieving finished successfully

-I- Duplicated GUIDs detection finished successfully

-W- Note: If you have unmanaged systems then duplication can occur
-W- Duplicated Node Description detection finished with warnings
-W- S5c25730300e77132/U2 - Node with GUID=0x5c25730300e77143 is configured with duplicated node description - localhost HCA-2
-W- S5c25730300e7fd52/U2 - Node with GUID=0x5c25730300e7fd63 is configured with duplicated node description - localhost HCA-2

-I- Port Hierarchy Info finished successfully


Lids Check
-I- Lids Check finished successfully


Links Check
-I- Links Check finished successfully


Subnet Manager
-I- SM Info retrieving finished successfully

-I- Subnet Manager Check finished successfully


Port Counters
-I- Build PMClassPortInfo

-I- Build PMPortSampleControl

-I- Build Port Counters

-I- Ports counters retrieving finished successfully

-I- RN counters retrieving finished successfully

-I- HBF counters retrieving finished successfully

-I- Going to sleep for 1 seconds until next counters sample

-I- Build Port Counters
-I- Ports counters retrieving (second time) finished successfully

-I- Ports counters value Check finished successfully

-I- Ports counters overflow value Check finished successfully

-I- pFRN Received Error check finished successfully

-I- Ports counters Difference Check (during run) finished successfully

-I- Ports counters delta check finished successfully


Nodes Information
-I- Devid: 41692(0xa2dc), PSID: MT_0000000884, Latest FW Version:32.42.1000
-I- Devid: 54000(0xd2f0), PSID: MT_0000000062, Latest FW Version:27.2010.5042
-I- FW Check finished successfully


Speed / Width checks
-I- Link Speed Check (Compare to supported link speed)
-I- Links Speed Check finished successfully

-I- Link Width Check (Compare to supported link width)
-I- Links Width Check finished successfully


Virtualization
-I- Build Virtualization Info DB

-I- Build VPort Info DB

-I- Build VPort Info DB

-I- Build VPort GUID Info DB

-I- Build VNode Info DB

-I- Build VPort PKey Table DB

-I- Build Node Description DB

-I- Virtualization finished successfully

-I- Virtual ports retrieving finished successfully

-I- Virtual ports retrieving finished successfully


Partition Keys
-I- Partition Keys retrieving finished successfully

-I- Partition Keys finished successfully


Temperature Sensing
-I- Temperature Sensing finished successfully


Routers
-I- Build Routers Info DB finished successfully

-I- Build Routers Tables finished successfully


Post Reports Generation
-I- Writing of IBNetdDscover file finished successfully

Fabric Summary

Total Nodes : 3
IB Switches : 1
IB Channel Adapters : 2
IB Aggregation Nodes : 0
IB Routers : 0

Adaptive Routing is enabled on 0 switches.
Hashed Based Forwarding is enabled on 0 switches.

Total number of links : 2
Links at 4x50 : 2

Master SM: Port=0 LID=1 GUID=0xa088c2030078685c devid=54000 Priority:0 Node_Type=SW Node_Description=MF0;snake0:MQM8700/U1
Standby SM : No Standby SM


Summary
-I- Stage Warnings Errors Comment
-I- Discovery 2 0
-I- Lids Check 0 0
-I- Links Check 0 0
-I- Subnet Manager 0 0
-I- Port Counters 0 0
-I- Nodes Information 0 0
-I- Speed / Width checks 0 0
-I- Virtualization 0 0
-I- Partition Keys 0 0
-I- Temperature Sensing 0 0
-I- Routers 0 0
-I- Post Reports Generation 0 0

-I- You can find detailed errors/warnings in: /var/tmp/ibdiagnet2/ibdiagnet2.log

-I- Database : /var/tmp/ibdiagnet2/ibdiagnet2.db_csv
-I- LST : /var/tmp/ibdiagnet2/ibdiagnet2.lst
-I- Network dump : /var/tmp/ibdiagnet2/ibdiagnet2.net_dump
-I- Subnet Manager : /var/tmp/ibdiagnet2/ibdiagnet2.sm
-I- Ports Counters : /var/tmp/ibdiagnet2/ibdiagnet2.pm
-I- RN counters 2 : /var/tmp/ibdiagnet2/ibdiagnet2.rnc2
-I- Nodes Information : /var/tmp/ibdiagnet2/ibdiagnet2.nodes_info
-I- VPorts : /var/tmp/ibdiagnet2/ibdiagnet2.vports
-I- VPorts Pkey : /var/tmp/ibdiagnet2/ibdiagnet2.vports_pkey
-I- Partition keys : /var/tmp/ibdiagnet2/ibdiagnet2.pkey
-I- IBNetDiscover : /var/tmp/ibdiagnet2/ibdiagnet2.ibnetdiscover

I see on the switch interface I plugged in has the following:
IB1/1 state:
Logical port state : Active
Physical port state : LinkUp
Current line rate : 200.0 Gbps
Supported speeds : sdr, qdr, fdr, edr, hdr
Speed : hdr
Supported widths : 1X, 2X, 4X
Width : 4X
Max supported MTUs : 4096
MTU : 4096
VL admin capabilities : VL0 - VL7
Operational VLs : VL0 - VL3
Description :
IB Subnet : infiniband-default
Phy-profile : high-speed-ber
Width reduction mode : Not supported
Telemetry sampling : Disabled
Telemetry threshold : Disabled
Telemetry record : Disabled
Telemetry threshold level: N/A bytes

RX:
Bytes : 5472
Packets : 19
Errors : 0
Symbol errors : 0
VL15 dropped packets: 0

TX:
Bytes : 5472
Packets : 19
Wait : 0
Discarded packets: 0

Hello,

I recommend checking the smiinfo from the host side to ensure the hosts can see the SM. Additionally, review the opensm.log on the switch for more clues. Since the physical links are up, ibdiagnet won’t provide much information; the focus should be on the opensm data. Look for any errors, Trap 128, and MAD timeouts. MAD timeouts can remove the switch from the SM. Depending on the type of failure- Disabling or replacing the modules and reloading the switch will prevent logical switch-down events.

Lastly, Feel free to open a support case if you have a valid entitlement

Thank you and have a wonderful day

Nvidia Support