BlueField-3 fail to start up infiniband network

Hi everyone,
Our Bluefield-3 and its host are both installed with DOCA 2.2.0.
And I am configuring my BlueField-3 Infiniband cards, and setting it as “NIC mode” under Infiniband netowork with these commands:

sudo mlxconfig -d /dev/mst/mt41692_pciconf0 -e s INTERNAL_CPU_OFFLOAD_ENGINE=1
sudo mlxconfig -d /dev/mst/mt41692_pciconf0 -e s LINK_TYPE_P1=1 LINK_TYPE_P2=1 

After (cold) reboot my machine, I found host fail to start the infiniband network as expected! My output is as follows:

xxx@bf3:/dev/mst$ ibstatus
Infiniband device 'mlx5_0' port 1 status:
	default gid:	 fe80:0000:0000:0000:a088:c203:0030:f0cc
	base lid:	 0xffff
	sm lid:		 0x0
	state:		 1: DOWN
	phys state:	 5: LinkUp
	rate:		 100 Gb/sec (4X EDR)
	link_layer:	 InfiniBand

Infiniband device 'mlx5_1' port 1 status:
	default gid:	 fe80:0000:0000:0000:a088:c203:0030:f0cd
	base lid:	 0xffff
	sm lid:		 0x0
	state:		 1: DOWN
	phys state:	 5: LinkUp
	rate:		 100 Gb/sec (4X EDR)
	link_layer:	 InfiniBand

Then I check the log of opensmd, which is:

xxx@bf3:/dev/mst$ sudo systemctl status opensmd
● opensmd.service - OpenSM
     Loaded: loaded (/lib/systemd/system/opensmd.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2023-09-13 11:46:57 UTC; 17min ago
   Main PID: 6732 (opensm)
      Tasks: 400 (limit: 153780)
     Memory: 20.6M
        CPU: 311ms
     CGroup: /system.slice/opensmd.service
             ├─6732 /usr/sbin/opensm
             └─6734 osm_crashd "" "" "" "" "" ""

Sep 13 11:46:57 ds01 OpenSM[6732]: /var/log/opensm.log log file opened
Sep 13 11:46:57 ds01 OpenSM[6732]: OpenSM 5.16.0.MLNX20230719.c143fc96
Sep 13 11:46:57 ds01 opensm[6732]: OpenSM 5.16.0.MLNX20230719.c143fc96
Sep 13 11:46:57 ds01 opensm[6732]: Using default GUID 0xa088c2030030f0cc
Sep 13 11:46:57 ds01 OpenSM[6732]: Entering DISCOVERING state
Sep 13 11:46:57 ds01 opensm[6732]: Entering DISCOVERING state
Sep 13 11:46:58 ds01 OpenSM[6732]: SM port is down
Sep 13 11:46:58 ds01 opensm[6732]: SM port is down
Sep 13 11:46:58 ds01 opensm[6732]: Check SM is configured to use a physical port
Sep 13 11:46:58 ds01 OpenSM[6732]: Check SM is configured to use a physical port

I suspect that SM port is related to my error, but rebooting the opensmd service does not help, and I make sure that both the NIC port and cable are in the health state(i.e. physically linkup), maybe someone have any idea?

  1. SM log show you run it on down port at GUID 0xa088c2030030f0cc. it will not let SM work.

  2. ibstatus show you link BF3 (CX7 NDR) to EDR switch.

So… may I ask how to solve these errors? Should I change an IB switch?

Actually, I have another BF3 that is connected to EDR switch, which works fine as expected, so I am really not sure about what to do…

:~$ ibstatus
Infiniband device 'mlx5_0' port 1 status:
	default gid:	 fe80:0000:0000:0000:a088:c203:0032:3046
	base lid:	 0x1
	sm lid:		 0x1
	state:		 4: ACTIVE
	phys state:	 5: LinkUp
	rate:		 100 Gb/sec (4X EDR)
	link_layer:	 InfiniBand

Infiniband device 'mlx5_1' port 1 status:
	default gid:	 fe80:0000:0000:0000:a088:c203:0032:3047
	base lid:	 0x2
	sm lid:		 0x1
	state:		 4: ACTIVE
	phys state:	 5: LinkUp
	rate:		 100 Gb/sec (4X EDR)
	link_layer:	 InfiniBand

Guys! I fix this problem. Bluefield-3 DPU can actually work with NDR, as noted in https://docs.nvidia.com/networking/display/BlueField3DPU/Specifications, so this problem is not related to switch version.

The real problem is that there exists one Bluefield-3 DPU host doesn’t start up as expected. Instead, it stucks in this stage:

Here is the solution:
This initialization will timeout eventually, and I re-burn the new version of BFB to DPU, then restart this machine (let’s call it M).

After the reboot, I found all other Bluefield-3 DPU hosts that share a infiniband switch with M enter PORT_ACTIVE successfully!

I suspect that one port in PORT_INIT state will stop the SM from proceeding, making all other ports in the subnet fail to be active.

I don’t know whether it is an infiniband feature or just a bug as hardware and link layer implementation are not my area.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.