Before the Monday meeting, wanted to provide the latest on the debugging done to try and bootup the DPU , please review the below and suggest if any other option is available to recover the emmc and bootup DPU.
The likely cause seems to be a corrupted partition in emmc as noticed in grub and bootup logs. Are there any other debug/recovery bfb images with nvidia to recover and bootup the DPU.
We can discuss more during Monday
We have all doca packages installed/verified (sdk, runtime, tools, host-repo, rshim), minicom console working over pcie rshim (/dev/rshim0/console).
Host: Ubuntu 20.04.3 LTS (GNU/Linux 5.13.0-30-generic x86_64)
Rshim is Active, log below
root@navbhat-UCSC-C240-M6L:/home/navbhat/doca/bootimages/bootimages# sudo systemctl status rshim
● rshim.service - rshim driver for BlueField SoC
Loaded: loaded (/lib/systemd/system/rshim.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2022-03-02 22:54:37 IST; 17h ago
Docs: man:rshim(8)
Process: 2884 ExecStart=/usr/sbin/rshim $OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 2890 (rshim)
Tasks: 6 (limit: 308992)
Memory: 1.4M
CGroup: /system.slice/rshim.service
└─2890 /usr/sbin/rshim
Mar 02 23:20:01 navbhat-UCSC-C240-M6L rshim[2890]: rshim0 boot close
Mar 02 23:22:03 navbhat-UCSC-C240-M6L rshim[2890]: rshim0 attached
Mar 03 14:17:55 navbhat-UCSC-C240-M6L rshim[2890]: rshim0 boot open
Mar 03 14:19:46 navbhat-UCSC-C240-M6L rshim[2890]: rshim0 boot timeout
Mar 03 14:19:46 navbhat-UCSC-C240-M6L rshim[2890]: rshim0 boot close
Mar 03 15:33:07 navbhat-UCSC-C240-M6L rshim[2890]: rshim0 boot open
Mar 03 15:33:44 navbhat-UCSC-C240-M6L rshim[2890]: rshim0 boot close
Mar 03 16:30:36 navbhat-UCSC-C240-M6L rshim[2890]: rshim0 boot open
Mar 03 16:32:27 navbhat-UCSC-C240-M6L rshim[2890]: rshim0 boot timeout
Mar 03 16:32:27 navbhat-UCSC-C240-M6L rshim[2890]: rshim0 boot close
root@navbhat-UCSC-C240-M6L:/home/navbhat/doca/bootimages/bootimages#
navbhat@navbhat-UCSC-C240-M6L:~$ ifconfig tmfifo_net0
tmfifo_net0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.100.1 netmask 255.255.255.252 broadcast 192.168.100.3
inet6 fe80::21a:caff:feff:ff02 prefixlen 64 scopeid 0x20<link>
ether 00:1a:ca:ff:ff:02 txqueuelen 1000 (Ethernet)
RX packets 2 bytes 164 (164.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 38 bytes 6209 (6.2 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
We can access the UEFI internal shell and grub prompt. Reading the emmc partitions indicates that the MBR might have got corrupted as per the log below.
grub> ls -l
Device proc: Filesystem type procfs - Sector size 512B - Total size 0KiB
Device hd0: No known filesystem detected - Sector size 512B - Total size
40747008KiB
Partition hd0,gpt2: Filesystem type ext* - Label writable' - Last modification time 2021-10-28 05:43:01 Thursday, UUID 2b8bddba-d258-4b86-94a5-2ca16043857b - Partition start at 52224KiB - Total size 40694767.5KiB Partition hd0,gpt1: Filesystem type fat - Label system-boot’, UUID
7A48-5C8E - Partition start at 1024KiB - Total size 51200KiB
grub>
We tried to boot the Recovery-Mode-Ubuntu(Advanced options for Ubuntu) from the grub, but it failed to boot with same emmc error (error log below). PS: Normal boot of Ubuntu also failed with same error(refer earlier mail)
GNU GRUB version 2.04
Ubuntu
*Advanced options for Ubuntu
Use the ^ and v keys to select which entry is highlighted.
Press enter to boot the selected OS, `e' to edit the commands
before booting or `c' for a command-line. ESC to return previous
menu.
GNU GRUB version 2.04
Ubuntu, with Linux 5.4.0-1017.16.gf565efa-bluefield
*Ubuntu, with Linux 5.4.0-1017.16.gf565efa-bluefield (recovery mode)
Use the ^ and v keys to select which entry is highlighted.
Press enter to boot the selected OS, `e' to edit the commands
before booting or `c' for a command-line. ESC to return previous
menu.
Loading Linux 5.4.0-1017.16.gf565efa-bluefield ...
Loading initial ramdisk ...
[ 4.018490] JBD2: Invalid checksum recovering block 716 in log
[ 4.024378] EXT4-fs (mmcblk0p2): error loading journal
We tried to install and boot the latest official bfb published in nvidia website that also fails to boot the DPU (error log below).
We tried to boot from grub by manually specifiying the vmlinuz and the initrd images from (hd0,gpt2)/boot/, it fails to boot.
grub> ls
(proc) (hd0) (hd0,gpt2) (hd0,gpt1)
grub> set root=(hd0,gpt2)
grub> ls /boot
efi/ initrd.img vmlinuz.old initrd.img.old vmlinuz grub/
initrd.img-5.4.0-1017.16.gf565efa-bluefield
config-5.4.0-1017.16.gf565efa-bluefield
System.map-5.4.0-1017.16.gf565efa-bluefield
vmlinuz-5.4.0-1017.16.gf565efa-bluefield
grub> linux /boot/vmlinuz-5.4.0-1017.16.gf565efa-bluefield
grub> initrd /boot/initrd.img-5.4.0-1017.16.gf565efa-bluefield
grub> boot
Whenever you see this, it usually means secure boot on the ATF failed. BL1 which is the root of trust on the chip failed to validate BL2R so it could not load it.
This particular card was an early sample card with secure boot enabled and the development keys installed. The signed image available for download on nvidia.com is signed with a different set of keys. An unsigned BFB was provided