BlueField 3 is not working

device: NVIDIA BlueField-3 B3210 P-Series FHHL DPU, 100GbE (default mode)

BlueField has been inaccessible since I rebooted the BlueField dpu (only dpu).

The following message occurs in the boot situation

[275604.216789] mlx5_core 0000:55:00.1: 63.008 Gb/s available PCIe bandwidth, limited by 8 GT/s x8 link at 0000:ae:00.0 (capable of 126.024 Gb/s with 16 GT/s x8 link)
[275624.187596] mlx5_core 0000:55:00.1: wait_fw_init:316:(pid 943): Waiting for FW initialization, timeout abort in 100s
[275644.152994] mlx5_core 0000:55:00.1: wait_fw_init:316:(pid 943): Waiting for FW initialization, timeout abort in 79s
[275664.118404] mlx5_core 0000:55:00.1: wait_fw_init:316:(pid 943): Waiting for FW initialization, timeout abort in 59s
[275684.083806] mlx5_core 0000:55:00.1: wait_fw_init:316:(pid 943): Waiting for FW initialization, timeout abort in 39s
[275704.049211] mlx5_core 0000:55:00.1: wait_fw_init:316:(pid 943): Waiting for FW initialization, timeout abort in 19s
[275723.954752] mlx5_core 0000:55:00.1: mlx5_function_setup:1237:(pid 943): Firmware over 120000 MS in pre-initializing state, aborting
[275723.968261] mlx5_core 0000:55:00.1: init_one:1813:(pid 943): mlx5_load_one failed with error code -16
[275723.978578] mlx5_core: probe of 0000:55:00.1 failed with error -16

So I’ve tried all the commands in the troubleshooting guide manual, but it doesn’t work.

  • sudo mlxconfig -d /dev/mst/ -y reset
  • sudo mlxconfig -d s LINK_TYPE_P1=2 LINK_TYPE_P2=2

When I checked the hardware connections with lshw it came up unclaimed

*-network:0 UNCLAIMED
       description: Ethernet controller
       product: MT43244 BlueField-3 integrated ConnectX-7 network controller
       vendor: Mellanox Technologies
       physical id: 0
       bus info: pci@0000:55:00.0
       version: 01
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress vpd msix pm cap_list
       configuration: latency=0
 *-network:1 UNCLAIMED
       description: Ethernet controller
       product: MT43244 BlueField-3 integrated ConnectX-7 network controller
       vendor: Mellanox Technologies
       physical id: 0.1
       bus info: pci@0000:55:00.1
       version: 01
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress vpd 

I did mlxfwreset after reading a bf2 post about unclaimed, but it didn’t work properly.
(Ref. BF2 DPU shows "unclaimed")

host> sudo mlxfwreset -d /dev/mst/mt41692_pciconf0 -l 3 reset
Requested reset level for device, /dev/mst/mt41692_pciconf0:

3: Driver restart and PCI reset
Please be aware that resetting the Bluefield may take several minutes. Exiting the process in the middle of the waiting period will not halt the reset
Continue with reset?[y/N] y
-I- Sending Reset Command To Fw             -Done
Arm OS shut down in progress, the completion of the process may take several minutes.
-E- The PCI link is still up even after the expected time (360.0) seconds has passed. Exiting the process..

+)

  • Secure boot is disabled.
  • Ubuntu 22.04
  • DOCA Version is 2.5.0 (host, dpu)

How can I resolve the issue?

Hello juwon,

Thank you for posting your inquiry on the NVIDIA Developer Forum - Infrastructure and Networking - Section.

Please re-seat the adapter in the PCI bus. If the issue still occurs, you can open a RMA if the adapter is still under warranty or has valid support entitlement.

Thank you and regards,
~NVIDIA Networking Technical Support

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.