BlueField 2 bricked

Hello, I have a Bluefield 2 DPU (PSID: MT_0000000704).
As a prerequisite to SNAP deployment, i had to configure the firmware to use the following:

  1. virtio-blk emulation PF: mlxconfig -d /dev/mst/mt41686_pciconf0 s VIRTIO_BLK_EMULATION_ENABLE=1 VIRTIO_BLK_EMULATION_NUM_PF=1

  2. NVMe emulation PF: mlxconfig -d /dev/mst/mt41686_pciconf0 s NVME_EMULATION_ENABLE=1 NVME_EMULATION_NUM_PF=1

To apply these firmware configurations successfully, I performed a Bluefield reboot using mlxfwreset -d 51:00.0 -y -l 3 --sync 1 r, which returned a success.

However, i was no longer able to ssh into the DPU OS even after a long wait, and the rshim service returned errors.

As a result, I went ahead and performed a cold reboot, and now my Host OS (Ubuntu 22.04) is stuck on a black screen. It wont boot at all, and only boots if i unplug the DPU from the machine.

I tried plugging in the DPU in another machine with a fresh Ubuntu installation, and it failed to boot as well. The DPU doesnt seem to be dead as i see flashing green lights on the OOB port if i plug it in to a switch using an RJ45 cable, but it fails to return a DHCP ack, which means i cant SSH into it using another server.

I am not sure how to go on from here, any guidance would be appreciated! Please help me out!

Hi,

You can use the console to monitor the card during boot; this might help identify any errors. You could also try reflashing the card and then attempting to access it again.

Thanks

Hi, I am sorry, but how would I go on about reflashing the card if the host doesn’t boot with the card plugged in? Any guidance would be appreciated!

Please explain why the host doesn’t boot with the card plugged in. This has happened to me before, and I resolved it by changing the server’s power/energy profile. I think you might see relevant information in the BIOS or monitoring logs, right?

let me know

you use a server ? which server, maybe can help to configure correctly.

I have a Dell Precision 7865, with Ubuntu 22.04 server installed. Everything was working fine till I enabled some firmware flags on the DPU so that I can start using SNAP (SNAP Installation - NVIDIA Docs)

A cold reboot later, the host is stuck on a black screen.

I will try those two things - changing the server’s power/energy profile, and I’ll go through the system logs on the BIOS to see if i can identify something.

Thanks for your reply! Appreciated

First of all, you need to ensure proper airflow. If the airflow direction is incorrect, the temperature can exceed safe operating limits.

Second, please send me your server BIOS configuration. I need it before uninstalling the card.


This is the only log i could find pertaining to the DPU - NVMe device not ready
https://www.delltechnologies.com/asset/en-us/products/workstations/technical-support/precision-7865-tower-technical-guidebook.pdf.external - I have a Precision 7865

I cant tell if its a host server issue or related to the dpu and its emulation configuration

You need to use a different PCIe slot, I mean, move it to another PCIe slot. https://www.dell.com/support/kbdoc/en-us/000222866/poweredge-r750-not-boot-and-have-an-error-nvmexpress-nvme-device-not-ready