Hi everyone,
Our Bluefield-3 and its host are both installed with DOCA 2.2.0.
And I am configuring my BlueField-3 Infiniband cards, and setting it as “NIC mode” under Infiniband netowork with these commands:
sudo mlxconfig -d /dev/mst/mt41692_pciconf0 -e s INTERNAL_CPU_OFFLOAD_ENGINE=1
sudo mlxconfig -d /dev/mst/mt41692_pciconf0 -e s LINK_TYPE_P1=1 LINK_TYPE_P2=1
After (cold) reboot my machine, I found host fail to start the infiniband network as expected! My output is as follows:
xxx@bf3:/dev/mst$ ibstatus
Infiniband device 'mlx5_0' port 1 status:
default gid: fe80:0000:0000:0000:a088:c203:0030:f0cc
base lid: 0xffff
sm lid: 0x0
state: 1: DOWN
phys state: 5: LinkUp
rate: 100 Gb/sec (4X EDR)
link_layer: InfiniBand
Infiniband device 'mlx5_1' port 1 status:
default gid: fe80:0000:0000:0000:a088:c203:0030:f0cd
base lid: 0xffff
sm lid: 0x0
state: 1: DOWN
phys state: 5: LinkUp
rate: 100 Gb/sec (4X EDR)
link_layer: InfiniBand
Then I check the log of opensmd, which is:
xxx@bf3:/dev/mst$ sudo systemctl status opensmd
● opensmd.service - OpenSM
Loaded: loaded (/lib/systemd/system/opensmd.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2023-09-13 11:46:57 UTC; 17min ago
Main PID: 6732 (opensm)
Tasks: 400 (limit: 153780)
Memory: 20.6M
CPU: 311ms
CGroup: /system.slice/opensmd.service
├─6732 /usr/sbin/opensm
└─6734 osm_crashd "" "" "" "" "" ""
Sep 13 11:46:57 ds01 OpenSM[6732]: /var/log/opensm.log log file opened
Sep 13 11:46:57 ds01 OpenSM[6732]: OpenSM 5.16.0.MLNX20230719.c143fc96
Sep 13 11:46:57 ds01 opensm[6732]: OpenSM 5.16.0.MLNX20230719.c143fc96
Sep 13 11:46:57 ds01 opensm[6732]: Using default GUID 0xa088c2030030f0cc
Sep 13 11:46:57 ds01 OpenSM[6732]: Entering DISCOVERING state
Sep 13 11:46:57 ds01 opensm[6732]: Entering DISCOVERING state
Sep 13 11:46:58 ds01 OpenSM[6732]: SM port is down
Sep 13 11:46:58 ds01 opensm[6732]: SM port is down
Sep 13 11:46:58 ds01 opensm[6732]: Check SM is configured to use a physical port
Sep 13 11:46:58 ds01 OpenSM[6732]: Check SM is configured to use a physical port
I suspect that SM port is related to my error, but rebooting the opensmd service does not help, and I make sure that both the NIC port and cable are in the health state(i.e. physically linkup), maybe someone have any idea?