I’m trying to enable two HCA back-to-back connection, each HCA is installed into a separated server.
ibstat shows the below
Host#0
CA ‘mlx5_0’
CA type: MT4131
Number of ports: 1
Firmware version: 40.45.1200
Hardware version: 0
Node GUID: 0x5000e60300a4f6cc
System image GUID: 0x5000e60300a4f6cc
Port 1: State: Initializing
Physical state: LinkUp
Rate: 400
Base lid: 65535
LMC: 0
SM lid: 0
Capability mask: 0xa751ec48
Port GUID: 0x5000e60300a4f6cc
Link layer: InfiniBand
Host#1
CA ‘mlx5_0’
CA type: MT4131
Number of ports: 1
Firmware version: 40.45.1200
Hardware version: 0
Node GUID: 0x5000e60300b9f66a
System image GUID: 0x5000e60300b9f66a
Port 1: State: Initializing
Physical state: LinkUp
Rate: 400
Base lid: 65535
LMC: 0
SM lid: 0
Capability mask: 0xa751ec48
Port GUID: 0x5000e60300b9f66a
Link layer: InfiniBand
It shows link state stuck in Initializing and when I run opensm from one of host, opensm failed with error below.
OpenSM 5.23.00.MLNX20250423.ac516692
Using default GUID 0x5000e60300a4f6cc
Entering DISCOVERING state
Error from osm_opensm_bind (0x2A)
Perhaps another instance of OpenSM is already running
Exiting SM
I have searched this forum and googling but I couldn’t find any solution for me.
Does anyone have similar issue or solution?
2. opensm -B >> start SM service , do you get an error?
I got error below in /var/log/opensm.log
Aug 18 09:23:34 154404 [234DE740] 0x03 → OpenSM 5.23.00.MLNX20250423.ac516692
OpenSM 5.23.00.MLNX20250423.ac516692
Aug 18 09:23:34 154485 [234DE740] 0x80 → OpenSM 5.23.00.MLNX20250423.ac516692
Aug 18 09:23:34 156855 [234DE740] 0x02 → osm_vendor_init: 1000 pending umads specified
Aug 18 09:23:34 156944 [234DE740] 0x02 → osm_vendor_init: 1000 pending umads specified
Aug 18 09:23:34 157004 [234DE740] 0x02 → osm_vendor_init: 1000 pending umads specified
Using default GUID 0x5000e60300b9f66a
Aug 18 09:23:34 173338 [234DE740] 0x02 → osm_tenant_mgr_init: tenant mgr is disabled
Aug 18 09:23:34 173479 [234DE740] 0x02 → osm_issu_mgr_init: issu_mgr is initialized
Entering DISCOVERING state
Aug 18 09:23:34 173509 [234DE740] 0x80 → Entering DISCOVERING state
Aug 18 09:23:34 173837 [234DE740] 0x02 → osm_vendor_rebind: Mgmt class 0x81 binding to port GUID 0x5000e60300b9f66a
Aug 18 09:23:34 191363 [234DE740] 0x01 → osm_vendor_rebind: ERR 5424: Unable to open port 0x5000e60300b9f66a
Aug 18 09:23:34 191374 [234DE740] 0x01 → osm_sm_mad_ctrl_bind: ERR 3118: Vendor specific bind failed
Aug 18 09:23:34 191385 [234DE740] 0x01 → osm_sm_bind: ERR 2E10: SM MAD Controller bind failed (IB_ERROR) for port guid 0x5000e60300b9f66a, port index 0
Error from osm_opensm_bind (0x2A)
Perhaps another instance of OpenSM is already running
Aug 18 09:23:34 193236 [234DE740] 0x02 → osm_tenant_mgr_destroy: osm_tenant_mgr_destroy complete
Aug 18 09:23:34 193250 [234DE740] 0x02 → osm_issu_mgr_destroy: osm_issu_mgr_destroy complete
Exiting SM
4. systemctl status opensmd.service >> if active ? stop it.
\u25cb opensmd.service - OpenSM
Loaded: loaded (/usr/lib/systemd/system/opensmd.service; disabled; preset: enabled)
Active: inactive (dead)
5. sminfo >> does the SM running , does it show who is running it? Lid & Guid ?
ibwarn: [13873] get_smi_gsi_pair: Can’t open UMAD port (No such device) ((null):0)
ibwarn: [13873] mad_rpc_open_port2: can’t open UMAD port ((null):0)
sminfo: iberror: failed: Failed to open ‘(null)’ port ‘0’