OVS cannot detect the bluefield interface, but the bluefield card works correctly

Hi,

I get two machines with each installed a bluefiled SmartNIC from Cloudlab. I changed one SmartNIC’s mode to ECPF mode. The other one is in Sperated host mode. I assigned two ips for the two SmartNIC interfaces on the host, two different ips for the other host. Then on each host ping the other host, ping success. However, when I login the ECPF mode SmarNIC and run sudo ovs-vsctl show, it shows:

1aaab65a-46a8-47a5-89bd-66cb18f8dac4
    Bridge ovsbr2
        Port p1
            Interface p1
        Port en3f1pf1sf0
            Interface en3f1pf1sf0
                error: "could not open network device en3f1pf1sf0 (No such device)"
        Port ovsbr2
            Interface ovsbr2
                type: internal
        Port pf1hpf
            Interface pf1hpf
                error: "could not open network device pf1hpf (No such device)"
    Bridge ovsbr1
        Port pf0hpf
            Interface pf0hpf
                error: "could not open network device pf0hpf (No such device)"
        Port ovsbr1
            Interface ovsbr1
                type: internal
        Port en3f0pf0sf0
            Interface en3f0pf0sf0
                error: "could not open network device en3f0pf0sf0 (No such device)"
        Port p0
            Interface p0
    ovs_version: "2.15.1"

It means that the OVS cannot detect the SmartNIC interfaces, right? Why is that and how to solve this problem? Thanks for your help.

Update: It was very strange, I just deleted the flow rule from ovsbr1, and ping from the x86 host to another x86 host and it still sccuess. How that would be if all packets need go throught the OVS and then go to another host, how the OVS still works if I deleted all rules for ovsbr1.

Hi Lyuxiaosu,

Thank you for posting your query on our forum.

I would like to check if you got an opportunity to review our following documentation in order to confirm configuration was done as per the document
a. Modes of Operation - BlueField DPU OS 3.8.5 - NVIDIA Networking Docs

b. Virtual Switch on BlueField DPU - BlueField DPU OS 3.8.5 - NVIDIA Networking Docs

If yes, and issue is still seen, I would like to request opening a support ticket since this might require in-depth debug and possible reproduction. In case you do not have an active contract with us, you may reach out to our contracts team at networking-contracts@nvidia.com . The support ticket can be opened by emailing Networking-support@nvidia.com

Thanks,
Namrata.

Hi Namrata,

Thanks for your reply. I read the documents and confirmed the mode is Embedded mode. The problem is PF(pf0hpf/pf1hpf) and SF(enp3s0f0s0/en3f1pf1sf0) ports cannot show up in BlueField OS. Run ifconfig or lshw to show all interfaces, there were no such interfaces. That’s what OVS complains. I tried several ways to debug this problem, I reinstalled the OS on BlueField 2, reset the NIC with command mlxconfig -d /dev/mst/mt41686_pciconf0 -y reset, but not working, the issue is still there.

I got two machines installed BlueField 2 SmartNIC from Cloudlab, one machine works correctly and can show all interfaces, the other one cannot. By comparing these two machines, I found the not working one does not have parameters ECPF_ESWITCH_MANAGER and ECPF_PAGE_SUPPLIER. When switched to the Embedded mode and run command mlxconfig -d /dev/mst/mt41686_pciconf0 s ECPF_ESWITCH_MANAGER=1 ECPF_PAGE_SUPPLIER=1, it complains The Device doesn't support ECPF_ESWITCH_MANAGER parameter.I guess this relates to the problem.

Should I email Networking-support@nvidia.com to describe the problem?

Thanks,
Xiaosu.

Hi Xiaosu,

Please share the following outputs from both, working and non-working machine:

a. #ibv_devinfo -v |egrep ‘board|fw’
b. From the ARM, #cat /etc/mlnx-release

If this requires extensive debug and in the case you submit a support ticket without an active support contract in place, unfortunately, debug would be restricted.

Thanks,
Namrata.

Hi Namrata,

Here is the result:
Run ibv_devinfo -v |egrep ‘board|fw’ on the host machine of both the working one and not the working one, the result is the same:

 fw_ver:                         16.28.4512
        board_id:                       DEL0000000016
        fw_ver:                         16.28.4512
        board_id:                       DEL0000000016
        fw_ver:                         24.32.2004
        board_id:                       MT_0000000477
        fw_ver:                         24.32.2004
        board_id:                       MT_0000000477

Run ibv_devinfo -v |egrep ‘board|fw’ on the ARM of the working one, the result is:

ubuntu@localhost:~$ ibv_devinfo -v |egrep 'board|fw'
        fw_ver:                         24.32.2004
        board_id:                       MT_0000000477
        fw_ver:                         24.32.2004
        board_id:                       MT_0000000477
        fw_ver:                         24.32.2004
        board_id:                       MT_0000000477
        fw_ver:                         24.32.2004
        board_id:                       MT_0000000477

Run ibv_devinfo -v |egrep ‘board|fw’ on the ARM of not the working one, the result is:

ubuntu@localhost:~$ ibv_devinfo -v |egrep 'board|fw'
        fw_ver:                         24.32.2004
        board_id:                       MT_0000000477
        fw_ver:                         24.32.2004
        board_id:                       MT_0000000477

Run cat /etc/mlnx-release on the ARM of both the working and not working machine, the result is the same, both are:

DOCA_1.3.0_BSP_3.9.0_Ubuntu_20.04-6.signed

Thanks for your reply.
Xiaosu.

Hi Xiaosu,

The PSID of card on both hosts is identical which means both are identical and on one machine it supports setting ECPF_ESWITCH_MANAGER=1 but not on other?

I wouldn’t expect the cards being same to behave differently. I would recommend opening a support ticket, however, please note that in case we determine that you do not possess an active contract level, support would be restricted.

Thanks,
Namrata.