Blue field DPU2 not coming up

Hi
We are seeing EMMC issues and BFB install also failing

Loading Linux 5.4.0-1017.16.gf565efa-bluefield …
Loading initial ramdisk …
[ 4.042778] JBD2: Invalid checksum recovering block 716 in log
[ 4.048645] EXT4-fs (mmcblk0p2): error loading journal

BFB image fails :
Mellanox BlueField-2 A1 BL1 V1.1
ERROR: Failed to load BL2R firmware.

Dear NVIDIA team,

Before the Monday meeting, wanted to provide the latest on the debugging done to try and bootup the DPU , please review the below and suggest if any other option is available to recover the emmc and bootup DPU.
The likely cause seems to be a corrupted partition in emmc as noticed in grub and bootup logs. Are there any other debug/recovery bfb images with nvidia to recover and bootup the DPU.

We can discuss more during Monday

  1. We have all doca packages installed/verified (sdk, runtime, tools, host-repo, rshim), minicom console working over pcie rshim (/dev/rshim0/console).
    Host: Ubuntu 20.04.3 LTS (GNU/Linux 5.13.0-30-generic x86_64)
    Rshim is Active, log below

     root@navbhat-UCSC-C240-M6L:/home/navbhat/doca/bootimages/bootimages# sudo systemctl status rshim
     ● rshim.service - rshim driver for BlueField SoC
          Loaded: loaded (/lib/systemd/system/rshim.service; enabled; vendor preset: enabled)
          Active: active (running) since Wed 2022-03-02 22:54:37 IST; 17h ago
            Docs: man:rshim(8)
         Process: 2884 ExecStart=/usr/sbin/rshim $OPTIONS (code=exited, status=0/SUCCESS)
        Main PID: 2890 (rshim)
           Tasks: 6 (limit: 308992)
          Memory: 1.4M
          CGroup: /system.slice/rshim.service
                  └─2890 /usr/sbin/rshim
    
     Mar 02 23:20:01 navbhat-UCSC-C240-M6L rshim[2890]: rshim0 boot close
     Mar 02 23:22:03 navbhat-UCSC-C240-M6L rshim[2890]: rshim0 attached
     Mar 03 14:17:55 navbhat-UCSC-C240-M6L rshim[2890]: rshim0 boot open
     Mar 03 14:19:46 navbhat-UCSC-C240-M6L rshim[2890]: rshim0 boot timeout
     Mar 03 14:19:46 navbhat-UCSC-C240-M6L rshim[2890]: rshim0 boot close
     Mar 03 15:33:07 navbhat-UCSC-C240-M6L rshim[2890]: rshim0 boot open
     Mar 03 15:33:44 navbhat-UCSC-C240-M6L rshim[2890]: rshim0 boot close
     Mar 03 16:30:36 navbhat-UCSC-C240-M6L rshim[2890]: rshim0 boot open
     Mar 03 16:32:27 navbhat-UCSC-C240-M6L rshim[2890]: rshim0 boot timeout
     Mar 03 16:32:27 navbhat-UCSC-C240-M6L rshim[2890]: rshim0 boot close
     root@navbhat-UCSC-C240-M6L:/home/navbhat/doca/bootimages/bootimages#
    
     navbhat@navbhat-UCSC-C240-M6L:~$ ifconfig tmfifo_net0
     tmfifo_net0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
             inet 192.168.100.1  netmask 255.255.255.252  broadcast 192.168.100.3
             inet6 fe80::21a:caff:feff:ff02  prefixlen 64  scopeid 0x20<link>
             ether 00:1a:ca:ff:ff:02  txqueuelen 1000  (Ethernet)
             RX packets 2  bytes 164 (164.0 B)
             RX errors 0  dropped 0  overruns 0  frame 0
             TX packets 38  bytes 6209 (6.2 KB)
             TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
    
  2. We can access the UEFI internal shell and grub prompt. Reading the emmc partitions indicates that the MBR might have got corrupted as per the log below.
    grub> ls -l
    Device proc: Filesystem type procfs - Sector size 512B - Total size 0KiB
    Device hd0: No known filesystem detected - Sector size 512B - Total size
    40747008KiB
    Partition hd0,gpt2: Filesystem type ext* - Label writable' - Last modification time 2021-10-28 05:43:01 Thursday, UUID 2b8bddba-d258-4b86-94a5-2ca16043857b - Partition start at 52224KiB - Total size 40694767.5KiB Partition hd0,gpt1: Filesystem type fat - Label system-boot’, UUID
    7A48-5C8E - Partition start at 1024KiB - Total size 51200KiB
    grub>

  3. We tried to boot the Recovery-Mode-Ubuntu(Advanced options for Ubuntu) from the grub, but it failed to boot with same emmc error (error log below). PS: Normal boot of Ubuntu also failed with same error(refer earlier mail)
    GNU GRUB version 2.04

        Ubuntu
       *Advanced options for Ubuntu
    
    
    
           Use the ^ and v keys to select which entry is highlighted.
           Press enter to boot the selected OS, `e' to edit the commands
           before booting or `c' for a command-line. ESC to return previous
           menu.
                                  GNU GRUB  version 2.04
    
    
        Ubuntu, with Linux 5.4.0-1017.16.gf565efa-bluefield
       *Ubuntu, with Linux 5.4.0-1017.16.gf565efa-bluefield (recovery mode)
    
    
    
    
           Use the ^ and v keys to select which entry is highlighted.
           Press enter to boot the selected OS, `e' to edit the commands
           before booting or `c' for a command-line. ESC to return previous
           menu.
    
     Loading Linux 5.4.0-1017.16.gf565efa-bluefield ...
     Loading initial ramdisk ...
     [    4.018490] JBD2: Invalid checksum recovering block 716 in log
     [    4.024378] EXT4-fs (mmcblk0p2): error loading journal
    
  4. We tried to install and boot the latest official bfb published in nvidia website that also fails to boot the DPU (error log below).

     navbhat@navbhat-UCSC-C240-M6L:~$ sudo bfb-install --rshim rshim0 --bfb doca/DOCA_v1.2.1_BlueField_OS_Ubuntu_20.04-5.4.0-1023-bluefield-5.5-2.1.7.0-3.8.5.12027-1.signed-aarch64.bfb -c  doca/bf.cfg
     [sudo] password for navbhat:
     Pushing bfb + cfg
     cat: write error: Connection timed out                                                                                                                                                                             ]
     128KiB 0:01:50 [1.16KiB/s] [       <=>                                                                                                                                                                            ]
     Failed to push BFB
     navbhat@navbhat-UCSC-C240-M6L:~$
     
      Console
      navbhat@navbhat-UCSC-C240-M6L:~$ sudo cat /dev/rshim0/console 115200
     Mellanox BlueField-2 A1 BL1 V1.1
     ERROR:   Failed to load BL2R firmware.
    
  5. We tried to boot from grub by manually specifiying the vmlinuz and the initrd images from (hd0,gpt2)/boot/, it fails to boot.
    grub> ls
    (proc) (hd0) (hd0,gpt2) (hd0,gpt1)
    grub> set root=(hd0,gpt2)
    grub> ls /boot
    efi/ initrd.img vmlinuz.old initrd.img.old vmlinuz grub/
    initrd.img-5.4.0-1017.16.gf565efa-bluefield
    config-5.4.0-1017.16.gf565efa-bluefield
    System.map-5.4.0-1017.16.gf565efa-bluefield
    vmlinuz-5.4.0-1017.16.gf565efa-bluefield
    grub> linux /boot/vmlinuz-5.4.0-1017.16.gf565efa-bluefield
    grub> initrd /boot/initrd.img-5.4.0-1017.16.gf565efa-bluefield
    grub> boot

Thanks
Navin

Whenever you see this, it usually means secure boot on the ATF failed. BL1 which is the root of trust on the chip failed to validate BL2R so it could not load it.

Mellanox BlueField-2 A1 BL1 V1.1
ERROR: Failed to load BL2R firmware.

This particular card was an early sample card with secure boot enabled and the development keys installed. The signed image available for download on nvidia.com is signed with a different set of keys. An unsigned BFB was provided