Hi, I have two BlueField 2 DPUs connected two different servers, and I am getting the same series of errors on both sides. I am trying to get the cards up and running but I am struggling to even update the firmware.
I have installed DOCA and I am following the guide here: BlueField DPU Administrator Quick Start Guide - NVIDIA Docs
I have updated the BFB image, and I have checked the firmware version:
ubuntu@localhost:~$ sudo flint -d /dev/mst/mt41686_pciconf0 q
Image type: FS4
FW Version: 24.29.1016
FW Release Date: 31.12.2020
Product Version: 24.29.1016
Rom Info: type=UEFI Virtio net version=21.1.11 cpu=AMD64
type=UEFI Virtio blk version=22.1.11 cpu=AMD64
type=UEFI version=14.22.14 cpu=AMD64,AARCH64
type=PXE version=3.6.204 cpu=AMD64
Description: UID GuidsNumber
Base GUID: b8cef603008de76c 14
Base MAC: b8cef68de76c 14
Image VSD: N/A
Device VSD: N/A
PSID: MT_0000000539
Security Attributes: N/A
I then try updating the firmware, but I get an error:
ubuntu@localhost:~$ sudo /opt/mellanox/mlnx-fw-updater/mlnx_fw_updater.pl --force-fw-update
Initializing...
Attempting to perform Firmware update...
Querying Mellanox devices firmware ...
Device #1:
----------
Device Type: BlueField2
Part Number: MBF2H332A-AENO_Ax
Description: BlueField-2 P-Series SmartNIC 25GbE Dual-Port SFP56; PCIe Gen3/4 x8; Crypto Disabled; 16GB on-board DDR; 1GbE OOB management; HHHL
PSID: MT_0000000539
PCI Device Name: 03:00.0
Base GUID: b8cef603008de76c
Base MAC: b8cef68de76c
Versions: Current Available
FW 24.29.1016 24.41.1000
NVMe N/A 20.4.0001
PXE 3.6.0204 3.7.0400
UEFI 14.22.0014 14.34.0012
UEFI Virtio blk 22.1.0011 22.4.0013
UEFI Virtio net 21.1.0011 21.4.0013
Status: Update required
---------
Found 1 device(s) requiring firmware update...
Device #1: Updating FW ...
Fail : Bad parameter
Log File: /tmp/e453EQ822u
Real log file: /tmp/mlnx_fw_update.log
I also cannot perform any kind of reboot:
From device:
ubuntu@localhost:~$ sudo mlxfwreset -d /dev/mst/mt41686_pciconf0 -l 4 -t 0 reset
The reset level for device, /dev/mst/mt41686_pciconf0 is:
4: Warm Reboot
Please be aware that resetting the Bluefield may take several minutes. Exiting the process in the middle of the waiting period will not halt the reset.
The ARM side will be restarted, and it will be unavailable for a while.
Continue with reset?[y/N] y
-I- Sending Reset Command To Fw -Failed
-E- Failed to send Register MFRL: Bad parameter (265).
From host:
host# mlxfwreset -d /dev/mst/mt41686_pciconf0 reset
-E- Synchronization by driver is not supported in the current state of this device.
I also tried only updating the firmware manually, trying an older version of the firmware:
host# flint -d /dev/mst/mt41686_pciconf0 -I ./fw-BlueField-2-rel-24_35_3502-MBF2H332A-AENO_Ax_Bx-NVME-20.4.1-UEFI-21.4.10-UEFI-22.4.10-UEFI-14.29.15-FlexBoot-3.6.902.bin burn
Current FW version on flash: 24.29.1016
New FW version: 24.35.3502
FSMST_INITIALIZE - OK
Writing Boot image component - OK
Restoring signature - OK
-I- To load new FW run mlxfwreset or reboot machine.
This seems to have worked:
ubuntu@localhost:~$ sudo bfvcheck
Beginning version check...
-RECOMMENDED VERSIONS-
ATF: v2.2(release):4.7.0-25-g5569834
UEFI: 4.7.0-42-g13081ae
FW: 24.41.1000
-INSTALLED VERSIONS-
ATF: v2.2(release):4.7.0-25-g5569834
UEFI: 4.7.0-42-g13081ae
FW: 24.29.1016
WARNING: FW VERSION DOES NOT MATCH RECOMMENDED!
WARNING: The firmware has been updated to 24.35.3502, but the chassis
must be power cycled for changes to take effect.
Version check complete.
But again I cannot reset the device:
host# mlxfwreset --device /dev/mst/mt41686_pciconf0 --sync 1 -y reset
-E- Synchronization by driver is not supported in the current state of this device.
Thank you for the help