Flash.sh (and tegraflash.py) fails to erase QSPI on non-SDK modules (P3767 Orin Nano 8GB)

All this has been performed using L4T 35_4_1

The tegraflash tooling fails to erase QSPI on all P3767 modules I’m testing with except for the ones that came with the jetson orgin nano devkit.

The same problem happens both when using the devkit carrier and my own carrier board.

I’m running:

NO_ROOTFS=1 NO_RECOVERY_IMG=1 ./flash.sh jetson-orin-nano-devkit-nvme external

This works fine the first time with a pristine jetson module straight out of the ESD bag.
The system boots all the way to the UEFI bootloader (see jetson_console1.log)

Then I assert force_recovery and do the exact same thing again.

Now MB2 fails to erase the QSPI (or maybe somehow performs a partial erase?) because it leaves the board in a non-working state. (Just keeps rebooting) (See jetson_console2.log)

It’s also evident on the host, the erase-pass takes ~1 second where it usually takes around 1-2 minutes.

[   8.8826 ] Start flashing
[   8.8830 ] tegradevflash_v2 --pt flash.xml.bin --create
[   8.8833 ] Bootloader version 01.00.0000
[   8.9948 ] Erasing spi: 0 ......... [Done]
[  10.0021 ] Writing partition secondary_gpt with gpt_secondary_3_0.bin [ 16896 bytes ]
[  10.0024 ] [................................................] 100%
[  10.0057 ] 000000004d4d2c01: E> NV3P_SERVER: Failed to initialize partition table from GPT.

Full log in host.txt

I’ve tested on two modules, same exact result.

host.txt (115.1 KB)
jetson_console.log (51.2 KB)
jetson_console2.log (95.7 KB)

Not sure if I’m missing something that needs to be done for non-SDK SOM units?

1 Like

Hi,

Please try initrd flash instead:

sudo ./tools/kernel_flash/l4t_initrd_flash.sh -p "--no-systemimg -c bootloader/t186ref/cfg/flash_t234_qspi.xml" --network usb0 jetson-orin-nano-devkit nvme0n1p1

We also hit this issue before, and looks like it only happens on modules that have not been flashed with any other means before, but we have not been able to allocate resources on solving it.

Ok, I think the reason for these problems is that the QSPI flash has block protection set.

Once it boots in Linux this gets cleared automatically by qspi_init() (some of the console output cut out for brevity)

[   52.288371] qspi_mtd spi6.0: MX25U51279G (65536 Kbytes)
[   52.288380] qspi_mtd spi6.0: mtd .name = spi6.0, .size = 0x4000000 (64MiB) .erasesize = 0x00010000 (64KiB) .numeraseregions = 0
[   52.288622] qspi_mtd spi6.0: block protection enabled 68
[   52.288684] qspi_mtd spi6.0: clearing block protect
[   52.298314] 1 fixed-partitions partitions found on MTD device spi6.0
[   52.298316] Creating 1 MTD partitions on "spi6.0":
[   52.298320] 0x000000000000-0x000004000000 : "Whole_flash0"

According to the MX25U51245G datasheet, the status register block protection bits (0x68 & 0x3c) >> 2 == 0b1010 means half of the flash (ie 32MB) is write protected. Depending on the T/B bit it’s either the first 512 blocks or the last 512 blocks. If T/B is 0 (default) it would be the last 512 blocks.

It’s a bit unclear to me how the MB2 manages to flash these write protected blocks (which it apparently seems to do as it works fine on a totally pristine module). Initially I suspected it flashed on the first 50% of the QSPI and ignored the failure at the last 512 blocks (The error reporting/detection seems to be somewhat lacking in this code, to say the least)

I suspect it’s actually removing and restoring the block protection during page write operations? Hard to tell without access to source code.

Regardless, It seems it’s not doing this during the initial chip erase (Chip Erase is forbidden if any of the block protection bits are set), so the chip operation just fails without any error detection/reporting.

If my analysis is correct, MB2 manages to flash the chip the first time around when it’s already fully erased. But the second time around it fails as the Chip Erase fails to execute.

May I suggest:

  • Actually checking the status from the Chip Erase command for better error indication/reporting.
  • Permanently turning off all the block protection bits in the QSPI during flash operations.

It’s also unclear how the block protection bits got set in the first place. According to the QSPI datasheet they should be off from the factory. Perhaps during some kind of manufacturing test?

1 Like

Hi,

Thanks for sharing your observation!
We will look into it.

It looks like initrd flashing worked. Did that have any impact on the mb2 flashing? For example, if you retried the mb2 flashing immediately afterward did it work? If mb2 flashing worked once, does it now always work or does it break again after you have flashed once with mb2?

I’m trying to understand why for example the Dev Kit always works fine but your new boards do not. At first I thought maybe it related to which QSPI is on the module. I keep a bunch of my old logs, and I can see that many of the boards I have tested show the MX25U51279G part number in the logs. I have not ever seen this issue.

Here are my observations.

It doesn’t matter if I use my own carrier board or the devkit’s carrier board.

The first time you flash a module it always works (regardless of if you flash it from mb2 or Linux) because it’s erased.

I suspect, but can’t confirm, that the flash page write code in mb2 will unlock the block-protection bits (but it looks like it restores it after page write ?)

The chip-erase code in mb2 is buggy because it a) doesn’t check if the chip-erase operation fails, b) doesn’t turn off the block protection bits.

As soon as Linux boots it will turn off the block protection bits regardless of if you flash the QSPI-chip or not. Note that this only happens if you start using bootrcm-method as, for some unclear reason, Nvidia has decided that it’s “unsafe” / “unsecure” to expose the QSPI to Linux when booting from flash?

I suspect that the modules shipped with devkit have been flashed via bootrcm / initrd method so they’re fine. But the history of the devkit modules are (obviously) hard for me to speculate about.

For example with a pristine module:

  • Flash via mb2 (OK)
  • Boot into Linux from flash (coldboot) (But linux can’t see QSPI chip in this case, so does nothing with it)
  • Force into recovery mode
  • Flash again from mb2 (chip-erase fails, flashing fails)

or

  • Flash via mb2 (OK)
  • Boot into Linux from bootrcm (Linux will see QSPI chip, complains about block prot bits and turn them off)
  • Force into recovery mode
  • Flash again from mb2 (works fine)

In the second sequence you mentioned where you used rcmboot to Linux, if you power cycle the board and attempt to flash with mb2 then it is back to a failing state? I’m trying to understand why we don’t see this same issue.

In your very first post you made this statement:

“The tegraflash tooling fails to erase QSPI on all P3767 modules I’m testing with except for the ones that came with the jetson orgin nano devkit.”

I was also trying to understand what’s different about that module. Is it using a different QSPI part?

As far as I can tell the part shipped on the devkit is the same. (But tbh, this is something that you at Nvidia should have a better knowledge of than I do, no?).

Also, I’ve flashed all modules I’ve got so I can’t really verify the power cycle behavior, but I’m 99% sure I’ve power cycled the modules when taking them into recovery mode. (Since my custom carrier board does not have a reset, so a full power cycle is required to restart the module into recovery mode)

At the moment I can work around the bug since I know what the problem is.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.