Failed to boot from certain NVMe with R35.5.0

Hi,

We are testing our customized board with both customized R35.5.0 (Jetpack 5.1.3) BSP and R35.4.1 (Jetpack 5.1.2) BSP.

It was observed that certain NVMe drives (Innodisk 4TG2-P) were unable to boot into the Ubuntu OS with R35.5.0, an issue not encountered with R35.4.1.

We tested a total of five NVMe drives (three Innodisk 4TG2-P and two WD SN550) and found that two of the Innodisk 4TG2-P drives exhibited this issue.

From the following experiments, it is evident that the issue is related to differences in the bootloader:
NOTE1: The problematic NVMe will be referred to as “BAD NVMe”.
NOTE2: Content within BAD NVMe is fixed as the rootfs of R35.5.0.

[Case 1] R35.5.0 QSPI + BAD NVMe
Result:
Unable to boot into Ubuntu OS, although the NVMe boot partition is recognized in the UEFI (We can see NVMe in the boot menu).

Test Steps:

  1. Perform a full flash with R35.5.0 using the following command:
sudo ./tools/kernel_flash/l4t_initrd_flash_pw.sh --external-device nvme0n1p1 \
  -c tools/kernel_flash/flash_l4t_external.xml -p "-c bootloader/t186ref/cfg/flash_t234_qspi.xml" \
  --showlogs --network usb0 pjai-100onox internal
  1. Power on the board

[Case 2] R35.4.1 QSPI + BAD NVMe
Result:
Boots successfully into Ubuntu OS.

Test Steps:

  1. Perform a full flash with R35.5.0 using the following command:
sudo ./tools/kernel_flash/l4t_initrd_flash_pw.sh --external-device nvme0n1p1 \
  -c tools/kernel_flash/flash_l4t_external.xml -p "-c bootloader/t186ref/cfg/flash_t234_qspi.xml" \
  --showlogs --network usb0 pjai-100onox internal
  1. Flash only QSPI flash in R35.4.1 via following command:
sudo ./flash.sh -k A_cpu-bootloader -c bootloader/t186ref/cfg/flash_t234_qspi.xml pjai-100onox nvme0n1p1
  1. Power on the board

Here are the boot logs for both test cases:
[Case 1] fail_boot_l4t_35_5_0_ox8g_inno_verbose.log (393.3 KB)
[Case 2] normal_boot_l4t_35_4_1_ox8g_inno_verbose.log (432.4 KB)

Other boot logs for working NVMe:
[Innodisk 4TG2-P]normal_boot_l4t_35_5_0_ox8g_inno_verbose.log (456.7 KB)
[WD SN550] normal_boot_l4t_35_5_0_ox8g_wd_verbose.log (456.0 KB)

Could you please help to check this issue? Thank you.

你可不可以用中文說一下這一段想表達什麼
bootloader difference是什麼意思?還有NOTE2我也看不太懂

Hi,

我們以前也有人報過這條Innodisk 4TG2-P在Jetson上有問題
如果只有特定牌子會這樣的話應該是SSD firmware有問題
建議你檢查一下有問題的那兩條和沒問題的那一條firmware版本一不一樣

Hi DaveYYY,

  1. Bootloader difference指的是R35.4.1與R35.5.0 QSPI內bootloader binary的差異,目前不確定是MB1、MB2或者UEFI中的哪一個影響到NVMe boot。
  2. BAD NVMe指的是搭R35.5.0無法開進Ubuntu OS的NVMe,若將module上的QSPI flash更新回R35.4.1(不重燒NVMe)則可開進Ubuntu OS。

比較奇怪的是,同一個NVMe,為何搭R35.4.1可開,但搭R35.5.0卻不能開?如果是相容性問題的話,一般是兩個版本都無法開才對

我會再比對一下這幾條Innodisk的Firmware版本,謝謝

你可以試試看在35.4.1上用其他device開機(可能USB drive)
插上那條SSD之後跑點stress test
之前的客戶是說在35.4.1/5.1.2上這樣做也會有IO error

開機不會中可能是碰巧 或者bootloader的差異 所以剛好沒有戳到會有問題的IO操作

了解,我會再做一些stress test確認穩定性

目前查到問題跟/dev/nvme0n1p10(RECROOTFS)這個partition的內容有關

部分NVMe燒錄後,nvme0n1p10會被UEFI掛載並執行其內部的BOOTAA64.efi(正確的BOOTAA64.efi是放在nvme0n1p11),若手動清除nvme0n1p10內容,即可正常開進OS

目前還在研究為何部分NVMe的nvme0n1p10也會有BOOTAA64.efi,以及為何R35.4.1搭配這類NVMe也能正常開進OS

Is this still an issue to support? Any result can be shared?

I modified “l4t_flash_from_kernel.sh” to clean the unused partition.

--- l4t_flash_from_kernel.sh	2024-02-20 12:38:14.252590000 +0800
+++ l4t_flash_from_kernel_new.sh	2024-06-05 09:14:04.872801000 +0800
@@ -734,6 +734,7 @@
 	local partition
 	local start_sector
 	local disk
+	local partition_blk_size
 	device_type=$(echo "${item}" | cut -d, -f 2 | sed 's/^ //g' - | cut -d: -f 1)
 	part_name=$(echo "${item}" | cut -d, -f 2 | sed 's/^ //g' - | cut -d: -f 3)
 	file_name=$(echo "${item}" | cut -d, -f 5 | sed 's/^ //g' -)
@@ -744,8 +745,30 @@
 	local res=0
 
 	if [ -z "${file_name}" ];then
-		print_log "Warning: skip writing ${part_name} partition as no image \
-is specified"
+		if [ "${device_type}" = "${EXTERNAL_STORAGE_DEVICE}" ]; then
+			print_log "Warning: No image is specified for ${part_name} partition"
+			print_log "Warning: Try to clean this partition"
+			if [ -n "${count}" ]  && [ "${count}" -ne 0 ]; then
+				partition=$(get_partition "${external_device}" "${count}")
+				echo "Get size of partition through connection."
+				# For host mode, the connection might get reset. Therefore, if it fails,
+				# need to do this to wait until the conenction is reestablished
+				wait_for_block_dev "${partition}"
+				pblksz=$(blockdev --getpbsz "/dev/${partition}")
+				chkerr "Get size of partition failed"
+				disk="$(get_disk_name "${external_device}")"
+				start_sector=$(cat "/sys/block/${disk}/${partition}/start")
+				partition_blk_size=$(cat "/sys/block/${disk}/${partition}/size")
+				chkerr "Get start sector of partition failed"
+			fi
+			echo "dd if=/dev/zero of=${part_name} seek=${start_sector} bs=${pblksz} count=${partition_blk_size}"
+			dd if=/dev/zero of=${part_name} seek=${start_sector} bs=${pblksz} count=${partition_blk_size}
+			res="${?}"
+			echo "Clean ${part_name} partition done"
+			return "${res}"
+		else
+			print_log "Warning: skip writing ${part_name} partition as no image is specified"
+		fi
 		return 0
 	fi
 

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.