Issues with upgrading firmware of Bluefield 2 DPU

Hi, I have two BlueField 2 DPUs connected two different servers, and I am getting the same series of errors on both sides. I am trying to get the cards up and running but I am struggling to even update the firmware.

I have installed DOCA and I am following the guide here: BlueField DPU Administrator Quick Start Guide - NVIDIA Docs

I have updated the BFB image, and I have checked the firmware version:

ubuntu@localhost:~$ sudo flint -d /dev/mst/mt41686_pciconf0 q
Image type:            FS4
FW Version:            24.29.1016
FW Release Date:       31.12.2020
Product Version:       24.29.1016
Rom Info:              type=UEFI Virtio net version=21.1.11 cpu=AMD64
                       type=UEFI Virtio blk version=22.1.11 cpu=AMD64
                       type=UEFI version=14.22.14 cpu=AMD64,AARCH64
                       type=PXE version=3.6.204 cpu=AMD64
Description:           UID                GuidsNumber
Base GUID:             b8cef603008de76c        14
Base MAC:              b8cef68de76c            14
Image VSD:             N/A
Device VSD:            N/A
PSID:                  MT_0000000539
Security Attributes:   N/A

I then try updating the firmware, but I get an error:

ubuntu@localhost:~$ sudo /opt/mellanox/mlnx-fw-updater/mlnx_fw_updater.pl --force-fw-update
Initializing...
Attempting to perform Firmware update...
Querying Mellanox devices firmware ...

Device #1:
----------

  Device Type:      BlueField2
  Part Number:      MBF2H332A-AENO_Ax
  Description:      BlueField-2 P-Series SmartNIC 25GbE Dual-Port SFP56; PCIe Gen3/4 x8; Crypto Disabled; 16GB on-board DDR; 1GbE OOB management; HHHL
  PSID:             MT_0000000539
  PCI Device Name:  03:00.0
  Base GUID:        b8cef603008de76c
  Base MAC:         b8cef68de76c
  Versions:         Current        Available
     FW             24.29.1016     24.41.1000
     NVMe           N/A            20.4.0001
     PXE            3.6.0204       3.7.0400
     UEFI           14.22.0014     14.34.0012
     UEFI Virtio blk   22.1.0011      22.4.0013
     UEFI Virtio net   21.1.0011      21.4.0013

  Status:           Update required

---------
Found 1 device(s) requiring firmware update...

Device #1: Updating FW ...
Fail : Bad parameter
Log File: /tmp/e453EQ822u
Real log file: /tmp/mlnx_fw_update.log

I also cannot perform any kind of reboot:

From device:

ubuntu@localhost:~$ sudo mlxfwreset -d /dev/mst/mt41686_pciconf0 -l 4 -t 0 reset

The reset level for device, /dev/mst/mt41686_pciconf0 is:

4: Warm Reboot
Please be aware that resetting the Bluefield may take several minutes. Exiting the process in the middle of the waiting period will not halt the reset.
The ARM side will be restarted, and it will be unavailable for a while.
Continue with reset?[y/N] y
-I- Sending Reset Command To Fw             -Failed
-E- Failed to send Register MFRL: Bad parameter (265).

From host:

host# mlxfwreset -d /dev/mst/mt41686_pciconf0  reset

-E- Synchronization by driver is not supported in the current state of this device.

I also tried only updating the firmware manually, trying an older version of the firmware:

host# flint -d /dev/mst/mt41686_pciconf0 -I ./fw-BlueField-2-rel-24_35_3502-MBF2H332A-AENO_Ax_Bx-NVME-20.4.1-UEFI-21.4.10-UEFI-22.4.10-UEFI-14.29.15-FlexBoot-3.6.902.bin burn

    Current FW version on flash:  24.29.1016
    New FW version:               24.35.3502

FSMST_INITIALIZE -   OK
Writing Boot image component -   OK
Restoring signature                     - OK
-I- To load new FW run mlxfwreset or reboot machine.

This seems to have worked:

ubuntu@localhost:~$ sudo bfvcheck
Beginning version check...

-RECOMMENDED VERSIONS-
ATF: v2.2(release):4.7.0-25-g5569834
UEFI: 4.7.0-42-g13081ae
FW: 24.41.1000

-INSTALLED VERSIONS-
ATF: v2.2(release):4.7.0-25-g5569834
UEFI: 4.7.0-42-g13081ae
FW: 24.29.1016

WARNING: FW VERSION DOES NOT MATCH RECOMMENDED!

WARNING: The firmware has been updated to 24.35.3502, but the chassis
must be power cycled for changes to take effect.

Version check complete.

But again I cannot reset the device:

host# mlxfwreset --device /dev/mst/mt41686_pciconf0 --sync 1 -y reset

-E- Synchronization by driver is not supported in the current state of this device.

Thank you for the help

“WARNING: The firmware has been updated to 24.35.3502, but the chassis
must be power cycled for changes to take effect.”

Please try to power cycle.

Regards,
Yaniv

You’ll be shocked to learn that this actually worked…
After the manual firmware burn with flint + reboot, I was able to run

ubuntu@localhost:~$ sudo /opt/mellanox/mlnx-fw-updater/mlnx_fw_updater.pl --force-fw-update

and I was also able to reset the device with mlxfwreset.

Thanks!

Keep in mind that this is not unusual to cold boot/power cycle the host to clear up any current states or pre/post FW upgrade (As applicable) even if indeed the mlxfwreset present and available for use.