Orin NX Fails to boot from NVMe at low temperature

olivier.lutz · March 22, 2024, 1:19pm

Dear Nvidia Team,

During Burn-In testing of Orin NX devices with our custom carrier board we have a high failure rate - 15% to 50%, depending on the run. For this Burn-In test, we boot the devices repeadeately over multiple temperature cycles from -25°C to 65°C.
We see that most failures happen at very low temperatures, around -15 and lower. I have attached a rs232 cable to the uart output of one device and it seems that the device boots into the UEFI shell and doesn’t try to boot from the attached NVMe at all.
Please find attached the output of a successfull boot here:
good_boot.txt (89.0 KB)
and the output of an unsuccessfull boot here:
bad_boot.txt (33.5 KB)
Both logs are from the same devie, just at different temperatures.

As NVMe we use an Apacer PV920 B92.925JHV.00211 with extended temperature range, rated for -40°C. On the baseboard, all components are rated for -40°C as well.
The Orin NX is running JetPack 5.1.2 with all overlays for L4T 35.4.1 installed.

Do you have an idea what could cause this problem or an input on how to debug this further?

Thank you very much in advance and best regards,
Olivier

olivier.lutz · March 22, 2024, 3:10pm

Just as an update:

We compiled the uefi in debug mode to get more output.
Please find the output at room temp here:
good_boot_debug.txt (162.2 KB)
And the output of a failed boot at -20°C here:
bad_boot_debug.txt.txt (102.3 KB)

I noticed, that during the unsuccessful boot, there is a PCIe Controller-4 Link is DOWN entry on line 1521, which is not there during the successful boot. The NVMe is connected to Controller 4.

During testing, I was able to start the device at -20°C when performing a soft-reboot from the boot menu, whereas the previous unsuccessful boots were all hard reboots.

WayneWWW · March 25, 2024, 3:38am

Hi,

I think the issue here is the nvme itself seems not working. Could you also use usb drive to boot first and see if nvme is able to get detected in kernel?

olivier.lutz · March 25, 2024, 8:08am

Dear Wayne,

Thank you for the suggestion.
Please find the log from a boot over USB here:
usb_boot_debug.txt (221.3 KB)

When booting from an attached USB device, it seems like the NVMe is not recognized by the UEFI - again indicated by the PCIe Controller-4 Link is DOWN on line 1522. However, the kernel itself recognizes the NVMe - see line 2598.

We’re now looking into changing the PCIe controller startup order or adding a delay to the UEFI before checking the PCIe links in the hope that the increased time helps.

If you have any further suggestions or insights, please feel free to share them.

Best regards,
Olivier

WayneWWW · March 25, 2024, 8:09am

Could you also test other kind of nvme? And please test on nv devkit too.

olivier.lutz · April 4, 2024, 8:51am

Hi Wayne,

We’ve noticed that replacing the NVMe with another one of the same type usually resolves the issue. We’re planning to conduct further testing on a devkit using a “known bad” NVMe, and we’ll share our findings here.

Best regards,
Olivier

olivier.lutz · April 5, 2024, 9:00am

Dear Wayne,

We’ve encountered the same issue on a devkit using a “known bad” NVMe. Below are the logs:

At room temperature (working):
log_devkit_good.txt (79.9 KB)

At -25°C (failing to boot):
log_devkit_bad.txt (31.0 KB)

Do you have any further suggestions on what we could try?

Thank you and best regards,
Olivier

WayneWWW · April 5, 2024, 9:35am

Your log indicates you are not using latest version of jetpack. Could you move to latest one ?

olivier.lutz · April 5, 2024, 1:02pm

Dear Wayne,

I flashed the device from my previous reply with JetPack 5.1.3 and redid the test. It shows the same issue.

At room temperature (working):
log_devkit_513_good.txt (85.7 KB)

At -25°C (failing to boot):
log_devkit_513_bad.txt.txt (35.1 KB)

Do you have any further suggestions?

Thank you and best regards,
Olivier

WayneWWW · April 5, 2024, 2:51pm

Hi,

If this is happened to specific nvme and low temperature, then we may need other further check. This sounds not NVIDIA related error for now.

Could you share what is that known bad nvme?

olivier.lutz · April 8, 2024, 7:19am

Hi Wayne,

The NVMe in question is an Apacer B92.925JHV.00211 (PV920-M280 series with extended temperature range, 240GB)

Best regards,
Olivier

olivier.lutz · April 9, 2024, 9:19am

Hi Wayne,

We’re currently working on a workaround to address the situation where the device fails to detect the NVMe drive during startup. Specifically, we want the device to automatically reset itself when it boots into the UEFI shell.
In a previous post I mentioned that the devices are able to start from the NVMe when soft-rebooting.
We were able to rebuild the esp.img and the startup.nsh now triggers a reset when we manually select the Shell from the boot menu by pressing F11 during boot.
However, on an unattended device where the NVMe is not recognized and the device launches into the Shell automatically, the startup.nsh script is not found.
Do you have any insights on how and where we should implement this startup.nsh script to ensure it is executed in such scenarios? Or could there be multiple locations where we would need to flash the esp.img to achieve this?

Thank you and best regards,
Olivier

WayneWWW · April 9, 2024, 9:22am

Are you talking about the script is stored inside the nvme so if nvme is not there, it cannot be found?

olivier.lutz · April 9, 2024, 9:31am

Hi Wayne,

I am asking if there is a way to flash this script to an onboard storage, e.g. the QSPI.

WayneWWW · April 9, 2024, 9:33am

Sorry, what script are you talking about? I am not sure what you are doing here.

olivier.lutz · April 9, 2024, 9:47am

Hi Wayne,

Sorry for the confusion.

When my device does not recognize the NVMe, it automatically launches the UEFI shell. The UEFI shell says it will automatically launch the startup.nsh script. This is all default behaviour so far. And as default, there is no startup.nsh script on the device.

I am now trying to somehow get a startup.nsh script onto the onboard storage of the device. Is there a way to do this?

Thank you and best regards,
Olivier

WayneWWW · April 9, 2024, 9:56am

Could you share the screenshot you saw on your side?

olivier.lutz · April 9, 2024, 10:06am

Hi Wayne,

The following Shell appears when the NVMe is not recognized:

WayneWWW · April 9, 2024, 11:12am

Hi,

Just to clarify. So you want to reboot the device when you hit above? Or you still want to boot from other media?

olivier.lutz · April 9, 2024, 11:30am

Hi Wayne,

Yes, exactly. I want to reboot automatically when we hit the Shell.

Topic		Replies	Views
Jetson orin nx : Stopping while booting Jetson Orin NX boot , board-design	15	607	May 8, 2024
NVMe sometimes lost on reboot - pcie_aspm=off influence Jetson Orin NX boot , board-design , nvme	32	508	July 26, 2024
Orin NX can't detect NVMe storage (SD Express PCIe) Jetson Orin NX boot , board-design	42	346	September 29, 2024
Orin NX 16GB overcurrent warning and running slowly Jetson Orin NX board-design , power	35	876	December 28, 2023
Issue regarding flashing Orin NX board Jetson Orin NX reflash	19	997	February 3, 2024
Previously functional Orin Nano won't boot after failed update, can't be re-flashed Jetson Orin Nano boot	23	97	August 27, 2024
Unable to flash to NVMe Jetson Orin Nano reflash	19	3058	May 1, 2023
Orin NANO 开机失败 Jetson Orin Nano boot , chinese	36	2896	June 21, 2023
Optimizing boot time on Orin NX (JP5.1.2) Jetson Orin NX boot , kernel	42	1334	May 8, 2024
Error at Orin Nano with NVMe Flash in NVMe and boot from NVMe Jetson Orin NX boot , reflash , board-design	40	2745	May 16, 2023

Orin NX Fails to boot from NVMe at low temperature

Related topics