Dear Nvidia Team,
During Burn-In testing of Orin NX devices with our custom carrier board we have a high failure rate - 15% to 50%, depending on the run. For this Burn-In test, we boot the devices repeadeately over multiple temperature cycles from -25°C to 65°C.
We see that most failures happen at very low temperatures, around -15 and lower. I have attached a rs232 cable to the uart output of one device and it seems that the device boots into the UEFI shell and doesn’t try to boot from the attached NVMe at all.
Please find attached the output of a successfull boot here:
good_boot.txt (89.0 KB)
and the output of an unsuccessfull boot here:
bad_boot.txt (33.5 KB)
Both logs are from the same devie, just at different temperatures.
As NVMe we use an Apacer PV920 B92.925JHV.00211 with extended temperature range, rated for -40°C. On the baseboard, all components are rated for -40°C as well.
The Orin NX is running JetPack 5.1.2 with all overlays for L4T 35.4.1 installed.
Do you have an idea what could cause this problem or an input on how to debug this further?
Thank you very much in advance and best regards,
Olivier
Just as an update:
We compiled the uefi in debug mode to get more output.
Please find the output at room temp here:
good_boot_debug.txt (162.2 KB)
And the output of a failed boot at -20°C here:
bad_boot_debug.txt.txt (102.3 KB)
I noticed, that during the unsuccessful boot, there is a PCIe Controller-4 Link is DOWN
entry on line 1521, which is not there during the successful boot. The NVMe is connected to Controller 4.
During testing, I was able to start the device at -20°C when performing a soft-reboot from the boot menu, whereas the previous unsuccessful boots were all hard reboots.
Hi,
I think the issue here is the nvme itself seems not working. Could you also use usb drive to boot first and see if nvme is able to get detected in kernel?
Dear Wayne,
Thank you for the suggestion.
Please find the log from a boot over USB here:
usb_boot_debug.txt (221.3 KB)
When booting from an attached USB device, it seems like the NVMe is not recognized by the UEFI - again indicated by the PCIe Controller-4 Link is DOWN
on line 1522. However, the kernel itself recognizes the NVMe - see line 2598.
We’re now looking into changing the PCIe controller startup order or adding a delay to the UEFI before checking the PCIe links in the hope that the increased time helps.
If you have any further suggestions or insights, please feel free to share them.
Best regards,
Olivier
Could you also test other kind of nvme? And please test on nv devkit too.
Hi Wayne,
We’ve noticed that replacing the NVMe with another one of the same type usually resolves the issue. We’re planning to conduct further testing on a devkit using a “known bad” NVMe, and we’ll share our findings here.
Best regards,
Olivier
1 Like
Dear Wayne,
We’ve encountered the same issue on a devkit using a “known bad” NVMe. Below are the logs:
At room temperature (working):
log_devkit_good.txt (79.9 KB)
At -25°C (failing to boot):
log_devkit_bad.txt (31.0 KB)
Do you have any further suggestions on what we could try?
Thank you and best regards,
Olivier
Your log indicates you are not using latest version of jetpack. Could you move to latest one ?
Dear Wayne,
I flashed the device from my previous reply with JetPack 5.1.3 and redid the test. It shows the same issue.
At room temperature (working):
log_devkit_513_good.txt (85.7 KB)
At -25°C (failing to boot):
log_devkit_513_bad.txt.txt (35.1 KB)
Do you have any further suggestions?
Thank you and best regards,
Olivier
Hi,
If this is happened to specific nvme and low temperature, then we may need other further check. This sounds not NVIDIA related error for now.
Could you share what is that known bad nvme?
Hi Wayne,
The NVMe in question is an Apacer B92.925JHV.00211 (PV920-M280 series with extended temperature range, 240GB)
Best regards,
Olivier
Hi Wayne,
We’re currently working on a workaround to address the situation where the device fails to detect the NVMe drive during startup. Specifically, we want the device to automatically reset itself when it boots into the UEFI shell.
In a previous post I mentioned that the devices are able to start from the NVMe when soft-rebooting.
We were able to rebuild the esp.img and the startup.nsh now triggers a reset when we manually select the Shell from the boot menu by pressing F11 during boot.
However, on an unattended device where the NVMe is not recognized and the device launches into the Shell automatically, the startup.nsh script is not found.
Do you have any insights on how and where we should implement this startup.nsh script to ensure it is executed in such scenarios? Or could there be multiple locations where we would need to flash the esp.img to achieve this?
Thank you and best regards,
Olivier
Are you talking about the script is stored inside the nvme so if nvme is not there, it cannot be found?
Hi Wayne,
I am asking if there is a way to flash this script to an onboard storage, e.g. the QSPI.
Sorry, what script are you talking about? I am not sure what you are doing here.
Hi Wayne,
Sorry for the confusion.
When my device does not recognize the NVMe, it automatically launches the UEFI shell. The UEFI shell says it will automatically launch the startup.nsh script. This is all default behaviour so far. And as default, there is no startup.nsh script on the device.
I am now trying to somehow get a startup.nsh script onto the onboard storage of the device. Is there a way to do this?
Thank you and best regards,
Olivier
Could you share the screenshot you saw on your side?
Hi Wayne,
The following Shell appears when the NVMe is not recognized:
Hi,
Just to clarify. So you want to reboot the device when you hit above? Or you still want to boot from other media?
Hi Wayne,
Yes, exactly. I want to reboot automatically when we hit the Shell.