Hi, I have a serious problem with my Bluefield-2 DPU. I managed to obtain two of these cards, and one of them is working perfectly, but after swapping it with the other, my system complains and I cannot access it at all.
First of all, it timeouts during bootup with the following errors: mlx5_function_setup:1237:(pid 943): Firmware over 120000 MS in pre-initializing state, aborting
I have found in a troubleshooting document from NVidia that in this case, I should reset the card (which I did without any success), and if nothing works, just reflash the image. So, I tried with bfb-install but it just hung up, and nothing happened after the exit Boot Service sequence.
In the beginning, I thought that maybe the system had some problems after properly installing everything and testing the first card, then swapping it with the second card.
I have a server mainboard with multiple PCIe x16 (both wired to the same CPU), so tried placing the second bluefield to the other slot… the same issue.
I also tried reinstalling the operating system from scratch and making the second Bluefield to be the first one the operating system sees. This did not work either, actually, the screenshots above are for the new OS.
The topic might be related to this, but I do not have another PC/server to try.
Yes, secure boot is disabled, and all security-kinda features are disabled too.
I did an apt upgrade on the working smartNIC, which did not come back after reboot. Cannot reflash the image on that either. Do I have now 2 expensive paperweights?
or do you mean secure boot on the NIC itself? I have a BF2M516A-CECOT, which has crypto and secure boot enabled.
But I thought secure boot would just prevent me from installing third-party images on the NIC. But any image signed by NVIDIA should work. And I am about to flash the standard image from NVIDIA on the NICs.
If I issue this on the host, it says secure boot is disabled on the NIC too
I don’t really know as I could not access the SmartNIC since opening the box. But I was trying to flash all 3 versions available: 1.5, 2.0, 2.2.
In fact, some magic has happened just now. I tried flashing v1.5, it still hang up at exit boot service and I left it there for 2 hours again. Nothing happened after two hours.
As a last resort, I thought as a last resort, I reboot the NIC only via echo "SW_RESET 1" > /dev/rshim0/misc and attach myself to the console via cat /dev/rshim0/console 115200 to see what happens. Automagically, I haven’t seen anything but quickly showed an ubuntu login prompt.
So, I restarted rshim on the host, reset IP to tmfifo_net0, and voila’…I can access it now, and my latest trial of version 1.5 is on the SmartNIC. This means that the install process actually finishes at some point, but I didn’t see the logs after the exit boot service…or at least, this is what I am thinking of now.
However, I don’t see any ports, no mlnx drivers are installed :(
I tried updating the system via apt, then rebooted the NIC again. It now ends up in an error I can see in the console is saying:
mlxbf2_gpio MLNXBF22:01: IRQ index 0 not found
mlxbf2_gpio MLNXBF22:02: IRQ index 0 not found
Is there any way to also interact with the boot process through the console. Because at the boot time, it says if I press two times, I can get into a UEFI menu. Maybe I can do something there, because the I think the Secure Boot is still enabled, I think I should disable it on the DPU somehow.
my PCIe link is working, as a graphic card is also used and was swapped sveral times to test the NIC.
I have realized something new, though.
If I reinstall the 1.5 DOCA via bfb-install, then even though the process hangs at exit boot process, if I am attached to the console, I can see the message of
INFO: Ubuntu installation started
but thenm there are several errors:
write counter to semaphore: Operation not permitted
write counter to semaphore: Operation not permitted
write counter to semaphore: Operation not permitted
write counter to semaphore: Operation not permitted
Yet, the Installing OS image action takes places and finishes.
Then, next message in the console is:
cannot find required sysfs path /sys/bus/platform/devices/MLNXBF04:00/post_reset_wdog
please load mlxvf_bootctl kernel driver
mount /dev/mmcblk0p1 at /mnt/efi_system_partition
umount /dev/mmcblk0p1
Then the NIC reboots and I can access it again via SSH. But I don’t see any physical ports via ifconfig.
If I run dmesg, i see many errors like
Lockdown: modprobe: unsigned module loading is restricted; see man kernel_lockdown.7
Lockdown: mlxfwmanager: direct PCI access is restricted; see man kernel_lockdown.7
Lockdown: mlxconfig: direct PCI access is restricted; see man kernel_lockdown.7
Lockdown: mdevices_info: direct PCI access is restricted; see man kernel_lockdown.7
...
I used /dev/rshim0/console to access UEFI in the same case as you.
During booting process I just pushed few times to a button quickly and somehow I got the log with UEFI.
The solution is to go to the UEFI BIOS configuration (default password is bluefield) and disable secure boot mode and set the BlueField mode to a valid configuration.
See Step 04.04: Ensure that the BlueField Mode is correctly set in the UEFI configuration in the following post
Hi, thanks for the info. I knew about the password but it was not working unforunately. I managed to overcome the issue, though, with additional things, I will explain it later after everything is working.
However, nowadays I cannot access BIOS/UEFI anymore again. The password I set does not work, after reflashing a new firmware hoping it have a password reset to bluefield, also does not help. Why this password never work? :)