What's the possible reason for that
Hi,
Yes we have reproduced the issue, in particular, during the week-end we have witnessed another crash using a vanilla Nvidia BSP 35.6.2 and a nano devkit. With 0% code from us. As it seems Nvidia support does not have access to the hardware, we are going to dispatch a fully repro to one of their facility, hardware and software stack so they can witness and hopefully investigate this matter.
Usually when it crashes, it end up with an unresolved exception in EL3. We are working into a workaround, far from being ideal but at least to mitigate the situation.
Hi,
Now on two distinct nano devkit, orin nx, ssd, we have witness several times this problem, so this seems confine within some of the module. Can you not make a request to retrieve some PCN-modules?
For me I don’t understand how you can provide valuable support if you do not even have the required hardware.
Also can you provide a way to interpret the revision of the module to see if we can extract some pattern.
Thanks
Hi
Is there still an engineer tracking this issue?
We have spent a month to test
Hi,
The issue is specific to certain modules. Please collect the modules and apply for RMA:
Jetson FAQ | NVIDIA Developer
Or contact local distributor to swap the modules.
So you confirm that some modules are faulty and need to be exchange? Is it possible to have SKUs’s ID that might be affected? As this is time consuming to reproduce this issue at production level and is also an expensive venture.
Thanks
Hi,
We are checking it internally and has not concluded it is a HW or SW issue. The repro rate is low so we would need some time to do investigation. It looks to be an issue occurring on specific modules, so if you can collect the modules, please apply for RMA process.
Hi,
We have more than an hundred modules that are affected and possibly more that went undetected. So we really need to understand if this problem is confined to some revisions or not. So that we can quarantine them by reading the EEPROM to save valuable time. We are in the process to collect skus for a batch of 20 faulty orin nx module.
From our experiments, this seems to be connected to external abort. Can you share what current finding internal has found so far?
Thanks
Hi,
Here is a partial list of defect orin nx module:
**Serial Numbers – Orin NX**
1. SN **1423225044968** | 3C6D66F30DF6 | 699-13767-0000-303 | 161-0546-10X
2. SN **1423225045191** | 3C6D66F30CB6 | 699-13767-0000-303 | 161-0546-10X
3. SN **1423225024519** | 3C6D66F326D1 | 699-13767-0000-303 | 161-0546-10X
4. SN **1423225024307** | 3C6D66F3258B | 699-13767-0000-303 | 161-0546-10X
5. SN **1422425080696** | 3C6D66B27068 | 699-13767-0000-303 | 161-0546-10X
6. SN **1423225023868** | 3C6D66F3257D | 699-13767-0000-303 | 161-0546-10X
7. SN **1423225023950** | 3C6D66F3256F | 699-13767-0000-303 | 161-0546-10X
8. SN **1423225024651** | 3C6D66F326C7 | 699-13767-0000-303 | 161-0546-10X
9. SN **1423225044482** | 3C6D66F30C6A | 699-13767-0000-303 | 161-0546-10X
10. SN **1423225024509** | 3C6D66F326A7 | 699-13767-0000-303 | 161-0546-10X
11. SN **1423225023467** | 3C6D66F32534 | 699-13767-0000-303 | 161-0546-10X
12. SN **1422525001084** | 3C6D66B275D9 | 699-13767-0000-303 | 161-0546-10X
13. SN **1423125062779** | 3C6D66F32551 | 699-13767-0000-303 | 161-0546-10X
14. SN **1422425064278** | 3C6D66B270BA | 699-13767-0000-303 | 161-0546-10X
15. SN **1423225045161** | 3C6D66F30CA4 | 699-13767-0000-303 | 161-0546-10X
16. SN **1423225024049** | 3C6D66F32596 | 699-13767-0000-303 | 161-0546-10X
17. SN **1423125062784** | 3C6D66F3255A | 699-13767-0000-303 | 161-0546-10X
18. SN **1423225024075** | 3C6D66F3257E | 699-13767-0000-303 | 161-0546-10X
19. SN **1423225044853** | 3C6D66F30C88 | 699-13767-0000-303 | 161-0546-10X
20. SN **1423225023951** | 3C6D66F32553 | 699-13767-0000-303 | 161-0546-10X
21. SN **1422425062564** | 3C6D66B2706B | 699-13767-0000-303 | 161-0546-10X
22. SN **1422525000951** | 3C6D66B27116 | 699-13767-0000-303 | 161-0546-10X
23. SN **1422425080695** | 3C6D66B26ECE | 699-13767-0000-303 | 161-0546-10X
24. SN **1423225024094** | 3C6D66F3257C | 699-13767-0000-303 | 161-0546-10X
25. SN **1422425064681** | 3C6D66B27759 | 699-13767-0000-303 | 161-0546-10X
26. SN **1421725089318** | 3C6D6661438F | 699-13767-0000-301 | 161-0546-10X
27. SN **1422525002558** | 3C6D66B28FE6 | 699-13767-0000-303 | 161-0546-10X
28. SN **1423525028588** | 4CBB4718A877 | 699-13767-0000-303 | 161-0546-10X
29. SN **1423525027776** | 4CBB4718A432 | 699-13767-0000-303 | 161-0546-10X
30. SN **1423525028587** | 4CBB4718A75F | 699-13767-0000-303 | 161-0546-10X
31. SN **1422525002418** | 3C6D66B28E72 | 699-13767-0000-303 | 161-0546-10X
32. SN **1422525002417** | 3C6D66B28E81 | 699-13767-0000-303 | 161-0546-10X
33. SN **1421725089604** | 3C6D666145AD | 699-13767-0000-301 | 161-0546-10X
34. SN **1422425065960** | 3C6D66B29618 | 699-13767-0000-303 | 161-0546-10X
35. SN **1422425080102** | 3C6D66B28FDD | 699-13767-0000-303 | 161-0546-10X
36. SN **1422525002495** | 3C6D66B28FD8 | 699-13767-0000-303 | 161-0546-10X
37. SN **1423525030807** | 4CBB47189E06 | 699-13767-0000-303 | 161-0546-10X
38. SN **1422525003950** | 3C6D66B27545 | 699-13767-0000-303 | 161-0546-10X
39. SN **1423525029458** | 4CBB4718A49E | 699-13767-0000-303 | 161-0546-10X
40. SN **1422525004176** | 3C6D66B2752A | 699-13767-0000-303 | 161-0546-10X
41. SN **1423525029101** | 4CBB4718A4D5 | 699-13767-0000-303 | 161-0546-10X
42. SN **1423525028836** | 4CBB4718A4D8 | 699-13767-0000-303 | 161-0546-10X
43. SN **1422525004173** | 3C6D66B27543 | 699-13767-0000-303 | 161-0546-10X
44. SN **1423525027795** | 4CBB4718A4AD | 699-13767-0000-303 | 161-0546-10X
45. SN **1422525002892** | 3C6D66B28DC7 | 699-13767-0000-303 | 161-0546-10X
46. SN **1423525027802** | 4CBB4718A42B | 699-13767-0000-303 | 161-0546-10X
47. SN **1423525029241** | 4CBB4718A4F4 | 699-13767-0000-303 | 161-0546-10X
48. SN **1423525029942** | 4CBB47189C49 | 699-13767-0000-303 | 161-0546-10X
Hi,
We have customer reporting PCIe C4 failing to detect NVMe SSD in booting, triggering system hangs in booting. It is specific to certain modules randomly and not specific to certain serial number. It is not expected the issue is present on so many modules. So if you put the modules on developer kit and flash r35.6.2, it cannot boot up successfully? There is failure rate or it fails to boot every time?
Hi,
We have seen as well to be connected with the PCIe/NVME driver setup but also that sometimes the memory bus trigger ECC before the MMU can handle them, and then a hardware interrupt is trigger leading to unhandled exception in EL3 within ATF.
We are trying to intercept those situations upfront and write into the scratch register for the boot slot so it reboot in the same slot then force a reboot. Even so we are able to detect (hopefully) the cases early upfront, the root cause remain a mystery. Obviously, we do not have access to all the intrinsic so you may have a better chance to tackle it.
We are going in the next few days/weeks stress test it to see if this is improved with this WAR.
It boots most of the time but depending of the module this can be within 60-6000 reboot with avg to 100-150. And we have tested this using our own custom board and BSP, but also using Xavier and nano devkit using stock sample BSP as provided by Nvidia, so there is probably some timing issue near the limit that trigger this issue or other factors, but on our side the module is mostly a black box.
log.txt (9.4 KB)
hi
we tested another machine, the system OS boot up failed, and it crached twice in UEFI, please help us to check , thank you
Hi @smileandcry2023
The assertion looks different from the PCIe detection failure:
ASSERT [PrePi] edk2-docker/nvidia-uefi/edk2-nvidia/Silicon/NVIDIA/PrePi/PrePi.c (507)
Do you observe it on custom board or developer kit? Do you use r35.6.2 or r36.5? Does it occur in each boot or there is failure rate?
No, just one of our comstomize board reported it.
Hi,
After contacting our Nvidia representative, we are dispatching a full repro, that is a nano devkit with a SSD and a faulty orin nx using vanilla sample BSP 35.6.2 as provided by Nvidia.
The full repro also automate the reboot at the right time and logs events to highlight when the issue occurs. We also made a document to explain how to interpret the results as well as how to reflash the full repro in case, there is a need to validate on other modules.
Hopefully, it will be possible with it to find the root causes and counter-measures.
Hi
I have tested the patch file “overlay_mb1bct_35.x.tbz2” and the description of it is
“This overlay fixes a boot issue caused by the QSPI read timing not having sufficient margin to cover process, voltage, and temperature variations.
“
the rate of the issue decrease too much
the content of the patch is
/ {
device {
qspiflash@0 {
trimmer2-val = <0x04>;
};
};
};
what’s the mean of this?
can we chage the value?
Hi @smileandcry2023
The overlay fixes the issue:
Jetson Orin Nano boot failure with temperature dependency
核心板无法启动
You may try 0x2 to see if stability improves further.
I changed the value as 0x02, but “ Orin nano UEFI开机屏幕显示L4TLauncher: Attempting Direct Boot无法关闭 “ this issuse happend more frequently
Hi @smileandcry2023
Please share how to replicate the issue on developer kit:
Orin nano UEFI开机屏幕显示L4TLauncher: Attempting Direct Boot无法关闭 - #43 by DaneLLL
And let’s continue discussion in the topic thread.
Hi Dane,
What about the full repro we have sent to Nvidia Taiwan that include the nano devkit, a problematic orin nx module and detailed instructions to reproduce it?
What about the investigation on the PCIe C4 failing to detect NVMe SSD in booting, triggering system hangs in booting as reported by another customer?
Any result can be shared?
Thanks