Xavier NX intermittent failure to mount rootfs from SD card

Hello,
We have JetPack 5.1.1 and JetPack 5.1.3 images that work fine on eMMC.
Then I run flash.sh for the SD card version, which is based off the P3509 board .config with SD card.
Next, I edit /boot/extlinux/extlinux.conf to set root to mmcblk1p1.
When I reboot, it works as desired.
However, subsequent reboots (sudo reboot now, sudo shutdown now -r) will sometimes fail to boot with this message:

UEFI attempts direct boot…
I/TC: Reserved shared memory is disabled
I/TC: Dynamic shared memory is enabled
I/TC: Normal World virtualization support is disabled
I/TC: Asynchronous notifications are disabled
[ 2.175148] Key type dns_resolver registered
[ 2.175876] Loading compiled-in X.509 certificates
[ 2.199472] Loaded X.509 cert ‘Build time autogenerated kernel key: f06dcb5f3ca1ae2974fb7cc6f14cffd536f9fdbf’
[ 5.641808] nvme nvme0: Removing after probe failure status: -19
[ 5.746688] tegra_cec 3960000.tegra_cec: tegra_cec_init started
[ 5.746976] tegra_cec 3960000.tegra_cec: probed
[ 5.750252] tegradc 15200000.display: hdmi: can’t get adpater for ddc bus 3
WARNING: clock_disable: clk_power_ungate on gated domain 27 for gpcclk
[ 5.899735] Root device found: mmcblk1p1
[ 5.996288] mmc0: host does not support reading read-only switch, assuming write-enable
[ 6.751853] tegra_cec 3960000.tegra_cec: Can’t find physical address.
[ 6.752057] tegra_cec 3960000.tegra_cec: tegra_cec_init Done.
[ 16.267137] ERROR: mmcblk1p1 not found

Full log of two boots, back-to-back failure attached
twoSDBootFails.log (33.4 KB)

This is intermittent, and it seems to happen about half the time.

I can’t find any other threads with similar issues, so I am asking if anyone has seen intermittent SD card failures.
The SD that is booting from is the microSD slot on Jetson p3668-0000 SOM itself, not no carrier board.

We need to support customers who use the microSD Jetson. Can anyone recommend a certain brand or feature of SD card that may be more reliable? We are currently using SanDisk Ultra cards.

If the sdcard slot is on the SOM, then it will be mmcblk0p1 when boot up… but not mmcblk1p1.

Wayne,
Your suggestion is not the issue on our board. Our DTB enables sdhci@3440000 for a microSD slot on our carrier board.

When we boot up on eMMC Jetson, mmcblk0p1 is on the eMMC.
When we boot up on uSD Jetson, mmcblk1p1 is on the SOM uSD slot, and mmcblk0p1 is on the carrier board uSD slot.
I account for that by editing /boot/extlinux/extlinux.conf to set rootfs=mmcblk1p1, and our board boots fine most of the time.

So, as you can see, that is not the issue. As I already stated in my question, the problem is intermittent failure in bold text, where the partition is found, but then later is not found.
Note how the boot process does find the partition mmcblk1p1, but then later fails to find it.
That is what I need help with. What could cause it to lose the partition on the SOM uSD slot? Rebooting usually solves this issue, but it is intermittent, so we need to find out what is going on and fix it. Is it some timing or frequency issue maybe?

Hi,

Your comment is full of something not clear to us. What is the exact board you are using here?

From your comment:

Then I run flash.sh for the SD card version, which is based off the P3509 board .config with SD card.

If this is your custom board, then why are you using p3509 board config with sdcard? P3509 board config does not enable sdmmc3 at all.

Also, it sounds totally not “intermittent”. It sounds totally predictable thing to me.

When we boot up on eMMC Jetson, mmcblk0p1 is on the eMMC.

Ok. Great.

When we boot up on uSD Jetson, mmcblk1p1 is on the SOM uSD slot, and mmcblk0p1 is on the carrier board uSD slot.

Then why are you using same device tree for different SOM and carrier board?

When your SOM or carrier board is changed, you need to update the device tree too. You cannot use one device tree forever. Even Jetpack5 is using different device tree for different kind of XNX modules…

We have a custom carrier board. I used the p3509 dts files as a baseline, then enabled the other sdhci for our carrier board SD slot.
I use the p3668 device trees for eMMC and SD jetsons, like p3509 files do: one for carrier and p3668-0000 and one for carrier with p3668-0001.
After flashing for carrier+jetson(with sd card) as mmcblk0p1, I have to edit extlinux.conf, then the board boots and sees rootfs on mmcblk1p1, because it uses our device tree.

However, some subsequent reboots have the error in my OP, only sometimes.
Can you point me to any reasons that the SD on Jetson p3668 would be flaky?

You should share the full logs.

Also, it is not some kind of difficult problem here. The first node that got enabled would be mmclk0p1. The 2nd one would be mmclk1p1.

The device tree order will affect that. Or you could even just hardcoded that in device tree. For example, make your on-carrier-board sd to be mmcblk3.

Wayne,
Could you please give an example of how I would hard-code that in the device tree? The SOM SD card is enabled by compiling the DTB with the file “tegra194-p3668-0000--sd.dts”. That file is based on “tegra194-p3668-0000-p3509-0000.dts”.
Would I change anything there to make the SOM SD be mmcblk0?

Here is my devicetree node for the carrier board SD card slot:

	carrier_sd: sdhci@3440000 {
		power-gpios = <&tegra_aon_gpio TEGRA194_AON_GPIO(AA, 2) GPIO_ACTIVE_HIGH>; // PAA.02 is pin 145, Carrier board SD card power enable. Drive high to enable power.
		cd-gpios = <&tegra_main_gpio TEGRA194_MAIN_GPIO(S, 4) GPIO_ACTIVE_HIGH>; // S.04/pin211 is high when inserted.
		nvidia,sd-device; // Tell it that it's an SD card, not eMMC.
		label = "carrier_SD";
		status = "okay";
	};

What lines should I add to my sdhci node to make the SOM SD be mmcblk0p1 and the carrier SD be something else? That would be convenient.

Ok, attached is a log of two failures in a row, and another log of a successful boot.
successBoot.log (18.8 KB)
twoSDBootFails.log (33.4 KB)

Did you change the log level of the kernel? I need the full dmesg but looks like your uart log does not have full one.

Not that I know of. What would I change to change the log level?

I don’t know what happened on your side. Better figuring out by yourself.

For example, you could read how other forum users’ UART log looks like… it is far more longer than yours in kernel parts.

Would it be one of these, in the .config?
image

You could add ignore_loglevel to your kernel cmdline and see if it prints more logs…

ok, here is a successful boot, after I appended ignore_loglevel to the args in extlinux.conf:
successLog_ignoreLoglevel.log (95.1 KB)

I see it is bigger than the first ones, good.

EDIT, I got a failure to boot, after rebooting 3 times!
Here is the log:
fail_ignoreLoglevel.log (88.1 KB)
I am reading it now. This is the first log with this much verbosity that I have seen. Let me know what you find!

Hi,

Just to clarify. I still don’t know what is your exact purpose here.

Could you tell us what is the exact “mmcblk1p1” you want to mount when it is on sdcard SOM and when it is on emmc SOM?

For example, I saw the “mmcblk1p1” from the log you just shared is from the sdmmc1.
But I am not sure if this is really what you want. Could you confirm that first? or you actually want it as sdmmc3?

I think the problem should be defined out first.
The reason why mmcblk1p1 is gone could be easily found out when you provide full log.

My purpose is in bold text in my original post. Sometimes, the boot will hang, saying
[ 5.899735] Root device found: mmcblk1p1 (that is normal)
then, many seconds later, hang
[ 16.267137] ERROR: mmcblk1p1 not found (this is intermittent, and boot hangs)

Hi,

Err… please stop answering the same answer to me.
I can see the bold text but that is not really important.

Read my question first.

If you purpose is just you don’t care about what is mmcblk1p1 and you just want to mount it. Then please also tell. It sounds not a good thing but if you don’t care then I don’t care. Is it okay?

If you don’t know my question, then please also tell.

Ok, it looks like the SD card is getting re-mounted as mmc0, is that correct?

[ 10.751820] mmc0: new ultra high speed SDR50 SDHC card at address e624
[ 10.752568] mmcblk0: mmc0:e624 SU32G 29.7 GiB
[ 10.752895] tegra_cec 3960000.tegra_cec: physical address: 10:00.
[ 10.759791] extcon-disp-state external-connection:disp-state: cable 53 state 1
[ 10.767097] Extcon HDMI: HPD enabled
[ 10.771033] tegradc 15200000.display: hdmi: plugged
[ 10.772979] mmcblk0: p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 p13 p14 p15 p16 p17 p18 p19 p20 p21 p22
[ 10.817435] tegra_cec 3960000.tegra_cec: Sent res: -113.
[ 10.817689] tegra_cec 3960000.tegra_cec: tegra_cec_init Done.
[ 10.821144] usb 1-3.1: New USB device found, idVendor=0424, idProduct=ec00, bcdDevice= 2.00
[ 10.821417] usb 1-3.1: New USB device strings: Mfr=0, Product=0, SerialNumber=0

How do I keep it from moving the SD card from mmcblk1 to mmcblk0?