ConnectX-8 back-to-back InfiniBand link stuck in Initializing, UMAD cannot open port, OpenSM fails to bind

Hi,

I am trying to connect two ConnectX-8’s back to back over two different servers. The link comes up and negotiates the speed, but the state never leaves initializing.

Here is the link and I am using an infiniband specific cable OSFPFL-400G-PC01:
sudo mlxlink -d /dev/mst/mt4131_pciconf0

Operational Info

State : Active
Physical state : N/A
Speed : IB-NDR
Width : 4x
FEC : Interleaved_Standard_RS_FEC_PLR - (544,514)
Loopback Mode : No Loopback
Auto Negotiation : ON

Supported Info

Enabled Link Speed : 0x000000c1 (NDR,HDR,SDR)
Supported Cable Speed : 0x000000f1 (NDR,HDR,EDR,FDR,SDR)

Troubleshooting Info

Status Opcode : 0
Group Opcode : N/A
Recommendation : No issue was observed

Tool Information

Firmware Version : 40.47.1088
amBER Version : 5.75
MFT Version : 4.34.1-10

Status Opcode : 0
Group Opcode : N/A
Recommendation : No issue was observed

Tool Information

Firmware Version : 40.47.1088
amBER Version : 5.75
MFT Version : 4.34.1-10

here is ibstat:

Troubleshooting Infoibstat
CA ‘mlx5_0’
CA type: MT4131
Number of ports: 1
Firmware version: 40.47.1088
Hardware version: 0
Node GUID: 0xXXXXXXXXde4
System image GUID: 0xXXXXXXXde4
Port 1:
State: Initializing
Physical state: LinkUp
Rate: 400
Base lid: 65535
LMC: 0
SM lid: 0
Capability mask: 0XXXXXc48
Port GUID: 0xXXXXXXXde4
Link layer: InfiniBand

ibstat of the other card:

CA ‘mlx5_0’
CA type: MT4131
Number of ports: 1
Firmware version: 40.47.1088
Hardware version: 0
Node GUID: 0xXXXXXXX76a
System image GUID: 0xXXXXXXX76a
Port 1:
State: Initializing
Physical state: LinkUp
Rate: 400
Base lid: 65535
LMC: 0
SM lid: 0
Capability mask: 0xXXXXXXc48
Port GUID: 0xXXXXXXX76a
Link layer: InfiniBand

when i try to open opensm:

Feb 04 19:09:33 529433 [C74A0740] 0x03 → OpenSM 5.25.1.MLNX20251030.e3791a47
Feb 04 19:09:33 529491 [C74A0740] 0x80 → OpenSM 5.25.1.MLNX20251030.e3791a47
Feb 04 19:09:33 535886 [C74A0740] 0x02 → osm_vendor_init: 1000 pending umads specified
Feb 04 19:09:33 535981 [C74A0740] 0x02 → osm_vendor_init: 1000 pending umads specified
Feb 04 19:09:33 536040 [C74A0740] 0x02 → osm_vendor_init: 1000 pending umads specified
Feb 04 19:09:33 554883 [C74A0740] 0x02 → osm_tenant_mgr_init: tenant mgr is disabled
Feb 04 19:09:33 555039 [C74A0740] 0x80 → Entering DISCOVERING state
Feb 04 19:09:33 555201 [C74A0740] 0x02 → osm_issu_mgr_init: issu_mgr is initialized
Feb 04 19:09:33 555421 [C74A0740] 0x02 → osm_vendor_rebind: Mgmt class 0x81 binding to port GUID 0x90e3170300f0bde4
Feb 04 19:09:33 566448 [C74A0740] 0x01 → osm_vendor_rebind: ERR 5424: Unable to open port 0x90e3170300f0bde4
Feb 04 19:09:33 566466 [C74A0740] 0x01 → osm_sm_mad_ctrl_bind: ERR 3118: Vendor specific bind failed
Feb 04 19:09:33 566473 [C74A0740] 0x01 → osm_sm_bind: ERR 2E10: SM MAD Controller bind failed (IB_ERROR) for port guid 0xXXXXXXXde4, port index 0
Feb 04 19:09:33 572124 [C74A0740] 0x02 → osm_tenant_mgr_destroy: osm_tenant_mgr_destroy complete
Feb 04 19:09:33 572153 [C74A0740] 0x02 → osm_issu_mgr_destroy: osm_issu_mgr_destroy complete
Feb 04 19:09:33 572245 [C74A0740] 0x80 → Exiting SM

I have also tried different configurations, config 1, 2, and 5: https://docs.nvidia.com/networking/display/nvidia-connectx-8-supernic-user-manual.pdf

Hi turbogarrett8,

Welcome, and thanks for posting your inquiry to the NVIDIA Developer Forums!

With XDR, an additional step is needed to create an SMI device which supports multi-plane on the desired RDMA interface.

For example, if you want the SM to be able to bind to mlx5_0:

/opt/mellanox/iproute2/sbin/rdma dev add smi2 type SMI parent mlx5_0
root@xdr1-b11-u07:~# rdma dev show
0: mlx5_0: node_type ca protocol ib fw 40.44.1036 node_guid 5000:e603:0005:6f0a sys_image_guid 5000:e603:0005:6f0a
root@xdr1-b11-u07:~# rdma dev add smi2 type SMI parent mlx5_0
root@xdr1-b11-u07:~# rdma dev show
0: mlx5_0: node_type ca protocol ib fw 40.44.1036 node_guid 5000:e603:0005:6f0a sys_image_guid 5000:e603:0005:6f0a
2: smi2: node_type ca fw 40.44.1036 node_guid 5000:e603:0005:6f0a sys_image_guid 5000:e603:0005:6f0a type smi parent mlx5_0 

Once this is created, OpenSM will be able to bind to the port.

Note: Default behavior does not change. OpenSM will still attempt to bind on the first available RDMA device unless otherwise specified (via guid configuration in opensm.conf ).
If that device does not have an SMI device created for it, the same failure will occur.

Best regards,
NVIDIA Enterprise Experience

This was the solution! Thank you!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.