Hi all, I recently tried to switch Bluefield-2 Infiniband DPU from Separated Host Mode to ECPF Mode and had difficulties with OVS startup. My steps are as follows.
I changed INTERNAL_CPU_MODEL to 1 according to DPU OS 3.9.0(Modes of Operation - BlueField DPU OS 3.9.0 - NVIDIA Networking Docs), and checked /etc/ mellanox/mlnx-ovs.conf configuration:
CREATE_OVS_BRIDGES="yes" OVS_BRIDGE1="ovsbr1" OVS_BRIDGE1_PORTS="p0 pf0hpf en3f0pf0sf0" OVS_BRIDGE2="ovsbr2" OVS_BRIDGE2_PORTS="p1 pf1hpf en3f1pf1sf0" OVS_HW_OFFLOAD="yes" OVS_START_TIMEOUT=30
Everything looked fine, then I power cycled the server. Afterwards, I went up to the DPU and checked with
sudo ovs-vsctl showand the result was as follows.
0ffc4fa4-fb7a-4e27-afef-4b6d80cd808f ovs_version: "2.15.1-d246dab"
It was empty. Unsuccessful bridging.
Then I tried to reinstall the DPU OS with BFB. My package is DOCA_1.3.0_BSP_3.9.0_Ubuntu_20.04-6.signed. To coordinate with the version on the DPU, I also reinstalled the ofed (184.108.40.206.3) and Bluefield driver for the Host environment. I check /etc/mellanox/mlnx-ovs.conf before power cycling the server with the same result as above and subsequently power cycled the machine.
Nothing changed with ovs, still unsuccessful bridging.
I went through the contents of BlueField DPU OS 3.9.0-Deploying DPU OS Using BFB from Host-Default Ports and OVS Configuration and checked the contents of /etc/modprobe.d/mlnx-bf.conf:
install ib_umad /sbin/modprobe --ignore-install ib_umad $CMDLINE_OPTS && (if [ -x /sbin/mlnx_bf_configure ]; then /sbin/mlnx_bf_configure; fi)
This seems inconsistent with the description of The /sbin/mlnx_bf_configure script runs automatically with mlx5_ib kernel module loaded in the documentation, and I’m not sure if this is the cause of the ovs failure.
I also tried running /sbin/mlnx_bf_configure directly and nothing happens.
mlnx-sf -a showprint nothing but an empty line,
ovs-vsctl showprint results that was the same as before.
In addition to the above, I checked the en3f1pf1sf0 port with the command
ifconfig en3f1pf1sf0and found the error:
en3f1pf1sf0: error fetching interface information: Device not found
Ok, thank you very much for reading this, this is all I have tried to do for this problem, and now there is nothing I can do. Can somebody give me a hand with this? I would like to offer my sincere thanks.
P.S. Our device works fine in Separated Host Mode, so I don’t think it’s a connection or hardware failure, but I welcome your criticism to point out my potential mistakes.