Orinnx reboot repeatly but it switch to B unexpectly

  What's the possible reason for that

Hi,

Yes we have reproduced the issue, in particular, during the week-end we have witnessed another crash using a vanilla Nvidia BSP 35.6.2 and a nano devkit. With 0% code from us. As it seems Nvidia support does not have access to the hardware, we are going to dispatch a fully repro to one of their facility, hardware and software stack so they can witness and hopefully investigate this matter.

Usually when it crashes, it end up with an unresolved exception in EL3. We are working into a workaround, far from being ideal but at least to mitigate the situation.

Hi,

Now on two distinct nano devkit, orin nx, ssd, we have witness several times this problem, so this seems confine within some of the module. Can you not make a request to retrieve some PCN-modules?

For me I don’t understand how you can provide valuable support if you do not even have the required hardware.

Also can you provide a way to interpret the revision of the module to see if we can extract some pattern.

Thanks

Hi

Is there still an engineer tracking this issue?

We have spent a month to test

Hi,
The issue is specific to certain modules. Please collect the modules and apply for RMA:
Jetson FAQ | NVIDIA Developer

Or contact local distributor to swap the modules.

So you confirm that some modules are faulty and need to be exchange? Is it possible to have SKUs’s ID that might be affected? As this is time consuming to reproduce this issue at production level and is also an expensive venture.

Thanks

Hi,
We are checking it internally and has not concluded it is a HW or SW issue. The repro rate is low so we would need some time to do investigation. It looks to be an issue occurring on specific modules, so if you can collect the modules, please apply for RMA process.

Hi,

We have more than an hundred modules that are affected and possibly more that went undetected. So we really need to understand if this problem is confined to some revisions or not. So that we can quarantine them by reading the EEPROM to save valuable time. We are in the process to collect skus for a batch of 20 faulty orin nx module.

From our experiments, this seems to be connected to external abort. Can you share what current finding internal has found so far?

Thanks

Hi,

Here is a partial list of defect orin nx module:

**Serial Numbers – Orin NX**

 1. SN **1423225044968** | 3C6D66F30DF6 | 699-13767-0000-303 | 161-0546-10X

 2. SN **1423225045191** | 3C6D66F30CB6 | 699-13767-0000-303 | 161-0546-10X

 3. SN **1423225024519** | 3C6D66F326D1 | 699-13767-0000-303 | 161-0546-10X

 4. SN **1423225024307** | 3C6D66F3258B | 699-13767-0000-303 | 161-0546-10X

 5. SN **1422425080696** | 3C6D66B27068 | 699-13767-0000-303 | 161-0546-10X

 6. SN **1423225023868** | 3C6D66F3257D | 699-13767-0000-303 | 161-0546-10X

 7. SN **1423225023950** | 3C6D66F3256F | 699-13767-0000-303 | 161-0546-10X

 8. SN **1423225024651** | 3C6D66F326C7 | 699-13767-0000-303 | 161-0546-10X

 9. SN **1423225044482** | 3C6D66F30C6A | 699-13767-0000-303 | 161-0546-10X

10. SN **1423225024509** | 3C6D66F326A7 | 699-13767-0000-303 | 161-0546-10X

11. SN **1423225023467** | 3C6D66F32534 | 699-13767-0000-303 | 161-0546-10X

12. SN **1422525001084** | 3C6D66B275D9 | 699-13767-0000-303 | 161-0546-10X

13. SN **1423125062779** | 3C6D66F32551 | 699-13767-0000-303 | 161-0546-10X

14. SN **1422425064278** | 3C6D66B270BA | 699-13767-0000-303 | 161-0546-10X

15. SN **1423225045161** | 3C6D66F30CA4 | 699-13767-0000-303 | 161-0546-10X

16. SN **1423225024049** | 3C6D66F32596 | 699-13767-0000-303 | 161-0546-10X

17. SN **1423125062784** | 3C6D66F3255A | 699-13767-0000-303 | 161-0546-10X

18. SN **1423225024075** | 3C6D66F3257E | 699-13767-0000-303 | 161-0546-10X

19. SN **1423225044853** | 3C6D66F30C88 | 699-13767-0000-303 | 161-0546-10X

20. SN **1423225023951** | 3C6D66F32553 | 699-13767-0000-303 | 161-0546-10X

21. SN **1422425062564** | 3C6D66B2706B | 699-13767-0000-303 | 161-0546-10X

22. SN **1422525000951** | 3C6D66B27116 | 699-13767-0000-303 | 161-0546-10X

23. SN **1422425080695** | 3C6D66B26ECE | 699-13767-0000-303 | 161-0546-10X

24. SN **1423225024094** | 3C6D66F3257C | 699-13767-0000-303 | 161-0546-10X

25. SN **1422425064681** | 3C6D66B27759 | 699-13767-0000-303 | 161-0546-10X

26. SN **1421725089318** | 3C6D6661438F | 699-13767-0000-301 | 161-0546-10X

27. SN **1422525002558** | 3C6D66B28FE6 | 699-13767-0000-303 | 161-0546-10X

28. SN **1423525028588** | 4CBB4718A877 | 699-13767-0000-303 | 161-0546-10X

29. SN **1423525027776** | 4CBB4718A432 | 699-13767-0000-303 | 161-0546-10X

30. SN **1423525028587** | 4CBB4718A75F | 699-13767-0000-303 | 161-0546-10X

31. SN **1422525002418** | 3C6D66B28E72 | 699-13767-0000-303 | 161-0546-10X

32. SN **1422525002417** | 3C6D66B28E81 | 699-13767-0000-303 | 161-0546-10X

33. SN **1421725089604** | 3C6D666145AD | 699-13767-0000-301 | 161-0546-10X

34. SN **1422425065960** | 3C6D66B29618 | 699-13767-0000-303 | 161-0546-10X

35. SN **1422425080102** | 3C6D66B28FDD | 699-13767-0000-303 | 161-0546-10X

36. SN **1422525002495** | 3C6D66B28FD8 | 699-13767-0000-303 | 161-0546-10X

37. SN **1423525030807** | 4CBB47189E06 | 699-13767-0000-303 | 161-0546-10X

38. SN **1422525003950** | 3C6D66B27545 | 699-13767-0000-303 | 161-0546-10X

39. SN **1423525029458** | 4CBB4718A49E | 699-13767-0000-303 | 161-0546-10X

40. SN **1422525004176** | 3C6D66B2752A | 699-13767-0000-303 | 161-0546-10X

41. SN **1423525029101** | 4CBB4718A4D5 | 699-13767-0000-303 | 161-0546-10X

42. SN **1423525028836** | 4CBB4718A4D8 | 699-13767-0000-303 | 161-0546-10X

43. SN **1422525004173** | 3C6D66B27543 | 699-13767-0000-303 | 161-0546-10X

44. SN **1423525027795** | 4CBB4718A4AD | 699-13767-0000-303 | 161-0546-10X

45. SN **1422525002892** | 3C6D66B28DC7 | 699-13767-0000-303 | 161-0546-10X

46. SN **1423525027802** | 4CBB4718A42B | 699-13767-0000-303 | 161-0546-10X

47. SN **1423525029241** | 4CBB4718A4F4 | 699-13767-0000-303 | 161-0546-10X

48. SN **1423525029942** | 4CBB47189C49 | 699-13767-0000-303 | 161-0546-10X

Hi,
We have customer reporting PCIe C4 failing to detect NVMe SSD in booting, triggering system hangs in booting. It is specific to certain modules randomly and not specific to certain serial number. It is not expected the issue is present on so many modules. So if you put the modules on developer kit and flash r35.6.2, it cannot boot up successfully? There is failure rate or it fails to boot every time?

Hi,

We have seen as well to be connected with the PCIe/NVME driver setup but also that sometimes the memory bus trigger ECC before the MMU can handle them, and then a hardware interrupt is trigger leading to unhandled exception in EL3 within ATF.

We are trying to intercept those situations upfront and write into the scratch register for the boot slot so it reboot in the same slot then force a reboot. Even so we are able to detect (hopefully) the cases early upfront, the root cause remain a mystery. Obviously, we do not have access to all the intrinsic so you may have a better chance to tackle it.

We are going in the next few days/weeks stress test it to see if this is improved with this WAR.

It boots most of the time but depending of the module this can be within 60-6000 reboot with avg to 100-150. And we have tested this using our own custom board and BSP, but also using Xavier and nano devkit using stock sample BSP as provided by Nvidia, so there is probably some timing issue near the limit that trigger this issue or other factors, but on our side the module is mostly a black box.

log.txt (9.4 KB)

hi

we tested another machine, the system OS boot up failed, and it crached twice in UEFI, please help us to check , thank you

Hi @smileandcry2023
The assertion looks different from the PCIe detection failure:

ASSERT [PrePi]  edk2-docker/nvidia-uefi/edk2-nvidia/Silicon/NVIDIA/PrePi/PrePi.c (507)

Do you observe it on custom board or developer kit? Do you use r35.6.2 or r36.5? Does it occur in each boot or there is failure rate?

No, just one of our comstomize board reported it.

Hi,

After contacting our Nvidia representative, we are dispatching a full repro, that is a nano devkit with a SSD and a faulty orin nx using vanilla sample BSP 35.6.2 as provided by Nvidia.

The full repro also automate the reboot at the right time and logs events to highlight when the issue occurs. We also made a document to explain how to interpret the results as well as how to reflash the full repro in case, there is a need to validate on other modules.

Hopefully, it will be possible with it to find the root causes and counter-measures.

Hi

I have tested the patch file “overlay_mb1bct_35.x.tbz2” and the description of it is

“This overlay fixes a boot issue caused by the QSPI read timing not having sufficient margin to cover process, voltage, and temperature variations.

the rate of the issue decrease too much

the content of the patch is

/ {
device {
qspiflash@0 {
trimmer2-val = <0x04>;
};
};
};

what’s the mean of this?

can we chage the value?

Hi @smileandcry2023
The overlay fixes the issue:
Jetson Orin Nano boot failure with temperature dependency
核心板无法启动

You may try 0x2 to see if stability improves further.

I changed the value as 0x02, but “ Orin nano UEFI开机屏幕显示L4TLauncher: Attempting Direct Boot无法关闭 “ this issuse happend more frequently

Hi @smileandcry2023
Please share how to replicate the issue on developer kit:
Orin nano UEFI开机屏幕显示L4TLauncher: Attempting Direct Boot无法关闭 - #43 by DaneLLL

And let’s continue discussion in the topic thread.

Hi Dane,

What about the full repro we have sent to Nvidia Taiwan that include the nano devkit, a problematic orin nx module and detailed instructions to reproduce it?

What about the investigation on the PCIe C4 failing to detect NVMe SSD in booting, triggering system hangs in booting as reported by another customer?

Any result can be shared?

Thanks