Opensm run failed

OS : ubuntu 24.04.1 (6.14.0-27-generic)

OpenSM : 5.23.00.MLNX20250423.ac516692

HCA : 900-9X81E-00EX-STO

I’m trying to enable two HCA back-to-back connection, each HCA is installed into a separated server.

ibstat shows the below

Host#0

CA ‘mlx5_0’
CA type: MT4131
Number of ports: 1
Firmware version: 40.45.1200
Hardware version: 0
Node GUID: 0x5000e60300a4f6cc
System image GUID: 0x5000e60300a4f6cc
Port 1:
State: Initializing
Physical state: LinkUp
Rate: 400
Base lid: 65535
LMC: 0
SM lid: 0
Capability mask: 0xa751ec48
Port GUID: 0x5000e60300a4f6cc
Link layer: InfiniBand

Host#1

CA ‘mlx5_0’
CA type: MT4131
Number of ports: 1
Firmware version: 40.45.1200
Hardware version: 0
Node GUID: 0x5000e60300b9f66a
System image GUID: 0x5000e60300b9f66a
Port 1:
State: Initializing
Physical state: LinkUp
Rate: 400
Base lid: 65535
LMC: 0
SM lid: 0
Capability mask: 0xa751ec48
Port GUID: 0x5000e60300b9f66a
Link layer: InfiniBand

It shows link state stuck in Initializing and when I run opensm from one of host, opensm failed with error below.
OpenSM 5.23.00.MLNX20250423.ac516692
Using default GUID 0x5000e60300a4f6cc
Entering DISCOVERING state
Error from osm_opensm_bind (0x2A)
Perhaps another instance of OpenSM is already running
Exiting SM

I have searched this forum and googling but I couldn’t find any solution for me.
Does anyone have similar issue or solution?

Hi ,

let’s try on host Host#0

  1. killall opensm >> stop SM service
  2. opensm -B >> start SM service , do you get an error?
  3. does it fix the issue? if not ?
  4. systemctl status opensmd.service >> if active ? stop it.
  5. sminfo >> does the SM running , does it show who is running it? Lid & Guid ?
  6. If still doesn’t work , try enabke the SM on Host#1 , does the SM works or is it the same issue?

Thanks.

2. opensm -B >> start SM service , do you get an error?
I got error below in /var/log/opensm.log
Aug 18 09:23:34 154404 [234DE740] 0x03 → OpenSM 5.23.00.MLNX20250423.ac516692
OpenSM 5.23.00.MLNX20250423.ac516692

Aug 18 09:23:34 154485 [234DE740] 0x80 → OpenSM 5.23.00.MLNX20250423.ac516692
Aug 18 09:23:34 156855 [234DE740] 0x02 → osm_vendor_init: 1000 pending umads specified
Aug 18 09:23:34 156944 [234DE740] 0x02 → osm_vendor_init: 1000 pending umads specified
Aug 18 09:23:34 157004 [234DE740] 0x02 → osm_vendor_init: 1000 pending umads specified
Using default GUID 0x5000e60300b9f66a
Aug 18 09:23:34 173338 [234DE740] 0x02 → osm_tenant_mgr_init: tenant mgr is disabled
Aug 18 09:23:34 173479 [234DE740] 0x02 → osm_issu_mgr_init: issu_mgr is initialized
Entering DISCOVERING state

Aug 18 09:23:34 173509 [234DE740] 0x80 → Entering DISCOVERING state
Aug 18 09:23:34 173837 [234DE740] 0x02 → osm_vendor_rebind: Mgmt class 0x81 binding to port GUID 0x5000e60300b9f66a
Aug 18 09:23:34 191363 [234DE740] 0x01 → osm_vendor_rebind: ERR 5424: Unable to open port 0x5000e60300b9f66a
Aug 18 09:23:34 191374 [234DE740] 0x01 → osm_sm_mad_ctrl_bind: ERR 3118: Vendor specific bind failed
Aug 18 09:23:34 191385 [234DE740] 0x01 → osm_sm_bind: ERR 2E10: SM MAD Controller bind failed (IB_ERROR) for port guid 0x5000e60300b9f66a, port index 0

Error from osm_opensm_bind (0x2A)
Perhaps another instance of OpenSM is already running
Aug 18 09:23:34 193236 [234DE740] 0x02 → osm_tenant_mgr_destroy: osm_tenant_mgr_destroy complete
Aug 18 09:23:34 193250 [234DE740] 0x02 → osm_issu_mgr_destroy: osm_issu_mgr_destroy complete
Exiting SM

4. systemctl status opensmd.service >> if active ? stop it.
\u25cb opensmd.service - OpenSM
Loaded: loaded (/usr/lib/systemd/system/opensmd.service; disabled; preset: enabled)
Active: inactive (dead)

5. sminfo >> does the SM running , does it show who is running it? Lid & Guid ?
ibwarn: [13873] get_smi_gsi_pair: Can’t open UMAD port (No such device) ((null):0)
ibwarn: [13873] mad_rpc_open_port2: can’t open UMAD port ((null):0)
sminfo: iberror: failed: Failed to open ‘(null)’ port ‘0’

Tried to both host and it still doesn’t work.

Thanks,
Glen

Hi, Try to reinstall OFED on one of the hosts and see if that resolves the issue. Linux InfiniBand Drivers