opensm failure after reboot, stuck on port initialization

Specs:

Linux Kernel: 4.15.0-140-generic

OS: Ubuntu 18.04

MLNX_OFED_LINUX-4.9-2.2.4.0-ubuntu18.04-x86_64

Everything was working until reboot. I tried several things with results provided below. Also the port GIUDs changed and I had to manually update /etc/opensm/opensm.conf. (NOTE: opensm.conf is default template, I only modified by specifying port IDs). (looking at the OFED manual now for further diagnostics).

— Report —

sudo modprobe ib_umad (worked?)

sudo modprobe xprtrdma

modprobe: ERROR: could not insert ‘rpcrdma’: Unknown symbol in module, or unknown parameter (see dmesg)

dmesg | tail

rpcrdma: Unknown symbol ib_alloc_cq (err 0)

rpcrdma: Unknown symbol ib_dereg_mr (err 0)

rpcrdma: Unknown symbol rdma_create_id (err 0)

rpcrdma: Unknown symbol ib_alloc_mr (err 0)

rpcrdma: Unknown symbol ib_free_cq (err 0)

rpcrdma: Unknown symbol rdma_accept (err 0)

rpcrdma: Unknown symbol ib_destroy_qp (err 0)

rpcrdma: Unknown symbol ib_dealloc_pd (err 0)

sminfo

ibwarn: [18194] mad_rpc_open_port: can’t open UMAD port ((null):0)

sminfo: iberror: failed: Failed to open ‘(null)’ port ‘0’

NOTE: no rdma service installed

sudo osmtest -f c (same output for -f a, except ‘validation’ instead of ‘inventory’)

Command Line Arguments

Done with args

Flow = Create Inventory

Apr 07 10:31:23 592167 [2110F740] 0x7f → Setting log level to: 0x03

Apr 07 10:31:23 592367 [2110F740] 0x02 → osm_vendor_init: 1000 pending umads specified

Apr 07 10:31:23 661108 [2110F740] 0x02 → osm_vendor_bind: Mgmt class 0x03 binding to port GUID 0x2c903003fc582

Apr 07 10:31:23 745819 [1F6B1700] 0x01 → __osmv_sa_mad_rcv_cb: ERR 5501: Remote error:0x000C

Apr 07 10:31:23 745869 [1F6B1700] 0x01 → osmtest_query_res_cb: ERR 0003: Error on query (IB_REMOTE_ERROR)

Apr 07 10:31:23 745955 [2110F740] 0x01 → osmtest_validate_sa_class_port_info: ERR 0070: ib_query failed (IB_REMOTE_ERROR)

Apr 07 10:31:23 745993 [2110F740] 0x01 → osmtest_validate_sa_class_port_info: Remote error = IB_MAD_STATUS_UNSUP_METHOD_ATTR

Apr 07 10:31:23 746013 [2110F740] 0x01 → osmtest_run: ERR 0138: Could not obtain SA ClassPortInfo (IB_REMOTE_ERROR)

OSMTEST: TEST “Create Inventory” FAIL

sudo systemctl restart opensm, output of /var/log/opensm.log

Apr 07 10:24:02 117367 [8BF43740] 0x80 → Exiting SM

Apr 07 10:26:07 072815 [E3C3D740] 0x03 → OpenSM 5.7.2.MLNX20201014.9378048

OpenSM 5.7.2.MLNX20201014.9378048

Apr 07 10:26:07 072926 [E3C3D740] 0x80 → OpenSM 5.7.2.MLNX20201014.9378048

Apr 07 10:26:07 077131 [E3C3D740] 0x02 → osm_vendor_init: 1000 pending umads specified

Apr 07 10:26:07 077241 [E3C3D740] 0x02 → osm_vendor_init: 1000 pending umads specified

Apr 07 10:26:07 077354 [E3C3D740] 0x02 → osm_vendor_init: 1000 pending umads specified

Entering DISCOVERING state

Apr 07 10:26:07 080343 [E3C3D740] 0x80 → Entering DISCOVERING state

Apr 07 10:26:07 080556 [E3C3D740] 0x02 → osm_vendor_bind: Mgmt class 0x81 binding to port GUID 0x2c903003fc581

Apr 07 10:26:07 171455 [E3C3D740] 0x02 → osm_vendor_bind: Mgmt class 0x81 binding to port GUID 0x2c903003fc582

Apr 07 10:26:07 257881 [E3C3D740] 0x02 → osm_vendor_bind: Mgmt class 0x03 binding to port GUID 0x2c903003fc581

Apr 07 10:26:07 344319 [E3C3D740] 0x02 → osm_vendor_bind: Mgmt class 0x04 binding to port GUID 0x2c903003fc581

Apr 07 10:26:07 344394 [E3C3D740] 0x02 → osm_vendor_bind: Mgmt class 0x21 binding to port GUID 0x2c903003fc581

Apr 07 10:26:07 344453 [E3C3D740] 0x02 → osm_opensm_bind: Setting IS_SM on port 0x0002c903003fc581

SM port is down

sudo hca_self_test.ofed

---- Performing Adapter Device Self Test ----

Number of CAs Detected … 1

PCI Device Check … PASS

Kernel Arch … x86_64

Host Driver Version … MLNX_OFED_LINUX-4.9-2.2.4.0 (OFED-4.9-2.2.4): 4.15.0-140-generic

Host Driver RPM Check … PASS

Firmware on CA #0 VPI … v2.42.5000

Host Driver Initialization … PASS

Number of CA Ports Active … 0

Error Counter Check on CA #0 (VPI)… PASS

Kernel Syslog Check … PASS

Node GUID on CA #0 (VPI) … NA

------------------ DONE ---------------------

Hello Willy,

Thank you for posting your inquiry on the NVIDIA Networking Community.

Based on the information provided, your kernel was updated, after installing MLNX_OFED 4.9 GA.

The default kernel version of Ubuntu 18.04 is 4.15.0.20.23. Your kernel version is -140. When the kernel is updated after the driver is installed, you need to reinstall the driver to have it rebuild against the new kernel.

We did an install in our lab with kernel -140, and we were able to successfully and run the driver.

Our recommendation is to reinstall the driver, which will resolve this issue.

Thank you and regards,

~NVIDIA Networking Technical Support

unfortunately still stuck.

sudo sminfo

ibwarn: [20373] _do_madrpc: recv failed: Connection timed out

ibwarn: [20373] mad_rpc: _do_madrpc failed; dport (DR path slid 0; dlid 0; 0)

sminfo: iberror: failed: query

sudo osmtest

Done with args

Flow = All Validations

Apr 08 11:12:02 903852 [3A15E740] 0x7f → Setting log level to: 0x03

Apr 08 11:12:02 903970 [3A15E740] 0x02 → osm_vendor_init: 1000 pending umads specified

Apr 08 11:12:02 969024 [3A15E740] 0x02 → osm_vendor_bind: Mgmt class 0x03 binding to port GUID 0x2c903003fc582

Apr 08 11:12:03 048683 [38700700] 0x01 → __osmv_sa_mad_rcv_cb: ERR 5501: Remote error:0x000C

Apr 08 11:12:03 048732 [38700700] 0x01 → osmtest_query_res_cb: ERR 0003: Error on query (IB_REMOTE_ERROR)

Apr 08 11:12:03 048828 [3A15E740] 0x01 → osmtest_validate_sa_class_port_info: ERR 0070: ib_query failed (IB_REMOTE_ERROR)

Apr 08 11:12:03 048879 [3A15E740] 0x01 → osmtest_validate_sa_class_port_info: Remote error = IB_MAD_STATUS_UNSUP_METHOD_ATTR

Apr 08 11:12:03 048898 [3A15E740] 0x01 → osmtest_run: ERR 0138: Could not obtain SA ClassPortInfo (IB_REMOTE_ERROR)

OSMTEST: TEST “All Validations” FAIL

sudo ibquery errors

ibwarn: [31162] sa_get_handle: No SM/SA found on port (null):0

UPDATE:

Working. It seems to be a finicky system. I rebooted, did some service restarts and plugged the cable into the alternative port on one card. For some reason the physical port was not responding on reboot.

That was the second time the physical connection did not initiate without re-inserting the transceiver.

Let’s see how it goes over the next while.

Thank for your help, should I close these tickets in some way?