Corrupted QSPI Support

Hi Team,

Hope you are well! I wanted to ask if anyone could support with the below:

I have been contacted with a problem one customer is encountering with their NVIDIA Jetson Orin NX 8GB.

They have a case in which their Jetson Orin NX board is not booting and the only way to restore it has been by connecting to USB-C and flashing the QSPI memory using an Ubuntu Linux Host.
They do not know the reason (they are investigating) but want to anticipate what could happen in the field and it seems safer to have a way to restore remotely.
On their board they have an MCU running Linux and it has free USBs. The problem is that it looks that it is needed a host with Linux Ubuntu image to flash the QSPI and they don’t know if it is possible to flash it from another basic Linux distribution running in a MPU included in their own board.

Have you ever encountered this problem? What happens if the QSPI is corrupted and the system is not booting, any remote way to update if this happens?

This is a good opportunity we want to win, any further info needed please let me know!

Many thanks!

Kind regards,
Edward

I won’t be able to answer this completely, but there is some information you’ll likely need in order to get an answer.

  • What exact model? If it is a dev kit, then answers are different than if this is a third party carrier board (different firmware is involved, plus the module itself differs on a dev kit versus a commercial module)
  • Note that models with an SD card on the carrier board are always third party carrier boards; models with an SD card on the module itself are always dev kits. Carrier boards and SD card placement (if those are involved change firmware. Updating firmware or software dependent upon firmware changes answers.
  • What is see from:
    head -n 1 /etc/nv_tegra_release
  • What is the content of “/etc/nv_boot_control.conf”?
  • If possible, can you attach a copy of “/boot/extlinux/extlinux.conf”?
  • What is the output of “cat /proc/cmdline”?
  • Is this unit using only the internal eMMC?

That information should help someone give a better answer. Note that QSPI is used differently on different models, and on different release versions.

Hi LinuxDev,

Thanks so much for your swift reply!
I have gone back to the customer and team and it seems there is a bit of a misunderstanding.

The customer is asking if there is some way to re-program the corrupted QSPI of the NVIDIA Jetson Orin NX 8GB SoM using an ARM Processor with a simple Linux running on it.
If this is not possible is there another way to reprogram the QSPI without accessing physically the board. ( The board is in a remote place and the QSPI has been corrupted for example).
Is this possible at all?

Customer is using their own custom carrier board with an external SSD. Not using a Development kit.

Many thanks!
Kind regards,
Edward

The QSPI must be accessed in recovery mode. This turns the Jetson into a custom USB device, and thus it requires a USB cable to a computer with the custom driver for that device (a recovery mode Jetson is not a standard mass/bulk storage device). The driver itself is written for Linux and is a binary executable compiled only to run on a desktop PC architecture. Thus there is no possibility of the executable to run in an ARM environment. I suppose if you ran in a VM there is a possibility, but this is more complicated than it sounds, and more or less implies installing an entire Ubuntu system on the VM ARM system.

One can command line flash just QSPI though. I have not tried to do that on Orin yet, and there might be some details to follow if it is L4T R36.x that differ from L4T R35.x.

The safest thing is to clone the rootfs prior to flashing, and then to “reuse” the rootfs image. During a normal flash both QSPI and everything else is flashed. If you reuse the same release of L4T flash software, then in theory you are updating just the QSPI.

However, using an external SSD complicates things. Sometimes part of boot is on a rootfs on the eMMC. One would clone both any rootfs on the eMMC and clone the SSD. The eMMC might be doing nothing more than a chroot to the SSD. There are a lot of variables though, e.g., an initrd changes things, and so does the a/b partition scheme for backup partitions.

There is a backup and restore script in the tools. I suspect that is the most thorough method of doing this, but I don’t know the details of where your initrd is stored, if there is a device tree stored on eMMC partitions, or if the device tree is in /boot, and if it is in /boot, if it is in:

  • initrd
  • eMMC /boot
  • SSD /boot

However, there is one key to all of this: You need to back up the SSD for the actual content. This is easily done with either the backup and restore tool script, or with dd on another computer. The dd method works on ARM devices. Cloning requires recovery mode. There might also be options for rsync on a running system.

If this were me, then I would probably prefer to start by putting the SSD on another computer and cloning it with dd. This would even copy any UID/GID. Secondarily, I would also create a clone backup in recovery mode. Then I would create a serial boot log (after removing any “quiet” in “/boot/extlinux/extlinux.conf”) and keep that for information on where any device tree is being pulled from, and the exact name of those device trees (this could be critical to putting a system back together manually if flash scripts fail).

Please note that it is important to pay special attention to any device tree because parts of the device tree may be in different locations, and the custom carrier board will need modifications for any part of the carrier board which is not an exact layout match to the reference carrier board. A fully verbose full serial console boot log could be useful before you ever decide how to start the backup or restore.

1 Like

I’m hopping in here, even though my issue is with the Orin Nano. Just yesterday, we had 2 separate similar incidents. Custom carrier board, has worked fine for weeks on both units. Orin Nano 4 Gb modules. Both units fired up as usual before a demonstration. They were powered down. One unit failed to boot on the next power up attempt. And then about an hour later, the other unit failed to boot when re-powered. Reflashed the QSPI on both units, didn’t touch anything else(different Nvme used for the flash), and both units are back to where they should be. Our only guess at this stage is a USB-C dongle that seems to draw more current than it should. Regardless, it’s very concerning that 2 QSPIs could become corrupted in such a short period of time, even with the suspect dongle (or other factors). Love to hear if anyone else is having these issues, especially if it points towards iffy QSPI ICs.

Knowing what is going on will be nearly impossible without a serial console boot log. It could have been something as simple as an automatic update with something incompatible. Maybe it was not shut down correctly (unclean shutdown) and the filesystem is damaged. If there is something incorrect about device tree, then the module won’t know the correct pins for given functions and might use the wrong pins (which could cause more power requirement and damage). This is unlikely to be a QSPI IC, but if it is, then you need a serial console boot log.

If you can clone, then you know hardware is working. This would also give you access to logs and a chance examine everything on the rootfs (such as package manager updates).

Not trying to be argumentative, just really want to understand the mechanics of what could be happening here.

There’s no boot log, because the bootloader didn’t work. As far as I can tell, it never ran any bootloader code, at least not enough to enable the hdmi output.

The unit was not attached to either Wifi or Ethernet, so does that rule out automatic update?

Unclean shutdown is a possibility. But AFAIK the NVME drive was perfectly fine. Here’s my troubleshoot:

  • Let’s say the original system is comprised of “Carrier Board A”, “NVME drive A”, “Orin Nano Module A”
    When the issue occurred (non-responsive upon power application), I took Orin Nano Module A and put it in Carrier Board B with NVME drive B attached. I reflashed Nano Module A in that configuration and put it back in Carrier Board A with NVME drive A (ie: the original Carrier Board and NVME drives were untouched by the re-flash).
    Worked perfectly.
    Is there any way that can’t be the QSPI?
    thx

Knowing the mechanics won’t happen without the log. Serial console logs include boot stage content before the Linux kernel ever loads. If serial console is truly not putting anything out, even in boot stages, then either it is hardware failure or it needs to be flashed again. The bootloader itself is at the tail of the boot content. There is content prior to that.

Jetsons do not have a BIOS. What they have is the equivalent in software, and this too is part of serial console (although the boot software has “quiet” versions and can have a “verbose” version flashed).

Yes, no WiFi or Ethernet does answer that question.

So far as the swapping of carrier boards goes, were they both flashed to use NVMe? Incidentally, I would think that part of that test is a valid step, but if the rootfs is designated by a partition UID, and if the other NVMe has a different UID for the partition, then it will still fail to find the rootfs. Serial console log would say something if it is functioning.

If the original module is not flashed to use NVMe, then probably this won’t say anything. If the module appears to not turn on for one carrier board, but does turn on for the other, then that is a valid test of power delivery, but not of software.

Your test is a good test, but you still need a serial console boot log from when it was failing. This could indeed be QSPI, but there are also non-rootfs partitions which take part in boot. Sometimes the layout is optional, e.g., some content, including device tree, can be loaded either from a partition or from rootfs (not QSPI). If the path and spec for device tree exists in extlinux.conf and points to the rootfs, then that is loaded; if this is missing, then the partition is loaded. If security fuses are burned, then only the partition version is loaded, and any extlinux.conf or /boot content is ignored. Binary partitions are always signed, and those too will fail if not signed (consider that no security fuse burned still signs, but it is a NULL signature). It is really really difficult to pin anything down without that serial log from when it is failing. Yes, it might be QSPI. It might not.

Add to this that the initrd has multiple ways to load, and one can have otherwise exactly matching binary partitions and QSPI, but one might fail while the other does not if the initrd differs. One always uses an initrd for external media boot (meaning rootfs on external media; simply having external media as secondary storage does not complicate anything). A serial console log would show the start of initrd load, and if it is in a binary partition versus on /boot of the eMMC (often, if you use external media for boot, there is still content in the /boot of eMMC due to chain loading). The log would answer this, but we don’t have the serial console log.

I will tell you something that is quite useful if you are in charge of creating and managing and supporting a Jetson. You should keep a reference serial console boot log to study for each configuration. For example, you’ve just finished flashing a dozen Orin NXs in the same way. Save a serial console boot log of one. Then if something goes wrong, you can compare to the serial console boot log of the failed unit.

It might be QSPI. It might not. Your flash has updated:

  • QSPI
  • Binary partitions
  • initrd

Awesome info! thanks. I’ll dive into getting reliable serial console boot log records. Question about your last point. On what physical memory are the Binary partitions and initrd stored? Are they QSPI or NVME? or dedicated memory elsewhere on the SOC? Many thanks,

For eMMC format modules the binary partitions are normally in the eMMC. For modules without eMMC (dev kits), then this tends to be in the QSPI. There might be some changes in JetPack 6.x, which I have not yet looked at. If you save a flash log, then this will actually tell you about where everything has been stored.