yes, I changed the files in Linux_for_Tegra/bootloader, but only for the two files below.
tegra234-bpmp-3767-0000-a02-3509-a02.dtb
tegra234-bpmp-3767-0000-a02-3509-a02-maxn.dtb
Do you find the another two files?
yes, I changed the files in Linux_for_Tegra/bootloader, but only for the two files below.
tegra234-bpmp-3767-0000-a02-3509-a02.dtb
tegra234-bpmp-3767-0000-a02-3509-a02-maxn.dtb
Do you find the another two files?
In my use case, I am using: tegra234-bpmp-3767-0000-a02-3509-a02.dts
I do have the other files and I am using 35.6.2 if this helps. We are currently testing this option with other changes we made as a SW WAR to this HW issue. Initial tests appears to improve the situation, but on some modules, there are still glitches. 44 times for ~2700 reboot.
We plan to extend the tests on a longer period with more boards to get a better confidence.
Once you try it, I would be interested if you could share your finding. We are on the same boat on that one.
Thanks
Thank you for sharing. We have just started testing and will upload the test results in a timely manner
Hi,
Modifying only tegra234-bpmp-3767-0000-a02-3509-a02.dtb should be good enough. The config file tegra234-bpmp-3767-0000-a02-3509-a02-maxn.dtb is applied when flashing MAXN config. The other two super dtb files are for super modes, which are enabled in later releases.
You can check Bit 12 to confirm it is disabled:
$ sudo busybox devmem 0x20041000
0x20011C25 // b0010 0000 0000 0001 0001 1100 0001 10001
// this is Bit 12
b0010 0000 0000 0001 000[1] 1100 0001 10001
Some update about our tests with the current SW WAR. 2 devkits were running for several days with ~9k reboot. One of them had 2 glitches that would have frozen the device without our own counter-measure. The others had 131 glitches but both are still up and running.
We also run on production units that include other parts. We run this over the weekend and out of the 12 units, 11 survive with ~5.5k reboot, one freeze but because this was a production unit, we do not know if this was because of this issue or due to some other phenomenons such as an hang during the reboot phase.
We are going to continue stress testing with a particular focus at that frozen unit.
Hope it helps!
Hi,
In our tests, when disabling spread spectrum to PCIe C4, the Orin NX module passes 7440 reboots and another Orin Nano module passes 9160 reboots. Both are put to Orin Nano developer board and flashed to Jetpack 6.2.2 r36.5 in the tests. The failure of detecting NVMe SSD storage shall be due to enabling spread spectrum. We propose disabling spread spectrum of PCIe C4 as a software solution.
If you have disabled it and still see issue with specific modules, please put the module to Orin Nano developer board, flash Jetpack 6.2.2 + disabling spread spectrum, and test if the issue still appears. If the issue is still present on developer kit, please share the UART log for reference.
Sorry to insist again, we really donât care anything about the Orin Nano (to stay polite) nor about jetpack 6.
We can consider jp6 once you added support for xavier generation on jp6, which wonât happen as you already didnât manage to fully support the xaviers on jp5 !
Currently we have several thousands of xaviers deployed worldwide and for a reason of compatibility we need to run jp5 over the whole mixed xavier/orin fleet for now.
So the fact stays the same, we have received from YOU hundreds of DEFECTIVE Orin NX modules, amounting to a value of several HUNDRED OF THOUSAND OF DOLLARS.
So we expect from you a proper root cause analysis and a proper fix, not a workaround, or a replacement of all defective devices. And stop shipping new defective modules at a rate of 20% of delivered modules.
Sorry to try to be clear and have to restate again what should be obvious by now.
Hi,
Just curious to see if you were able to perform some stress tests and what is the outcome on your end? For us, we have had a few devices that get stuck during the shutdown/reboot due to some NVIDIA service hanging the system so unrelated to the boot swap / hang in EL3. We are going to ditch this buggy and poorly written service as it is of no use in our context.
One of the devkit rebooting in UEFI was able to survive 40k reboot and had 450 glitches that our WAR managed to avoid a crash. So even the suggestion seems to reduce the number of occurrence, it is not magic bullet. In fact, we tried not to apply this suggestion and only keep our WAR and the devkit was fine. And using only the suggested change has triggered a crash in EL3.
At the end, we are going to apply the suggestion most likely (the least glitches the better) but can someone from Nvidia comment on Martinâs question regarding the potential impact on EMC compliance? That the very least you can do!
Hi,
We have the EMC certification for developer kit:
Jetson Download Center | NVIDIA Developer
For custom board, would need to obtain new EMC certification for the production system. Please disable spread spectrum of PCIe C4 and qualify EMC compliance.
Hi Dane,
Can you provide some information about the changes observed in the results regarding the EMC certifications for the official nano devkit, between enabling and disabling spread spectrum? We need to get some better understanding of what would be the implications for our own device. As qualify for EMC compliance is a long and costly venture.
Also as disabling spread spectrum is considered as a temporary measure based on feedback from our Nvidia representative, Nvidia should provide deep technical and financial support and/or replace all the modules with this defect with an associated fee due to the cost overhead involved.
Moreover, what is the current status of your investigations? Did you made any progress regarding the root cause of this phenomenon? Any progress regarding the behaviour of the PCIe interface?
Thanks for providing a thoughtfully response in a timely manner.
Hi Dane,
We also noticed this issue on our custom Orin NX setup, using Jetpack 5.
To go further and to test if UEFI was the culprit, we modified the flash partitions layout to add a mini-linux which would take care of the PCIe/NVMe init/deinit. To achieve this we got rid of the A/B partitioning system to fit our linux, which was compiled using the sources shipped with L4T, with some minimal config tuning. We also fully disabled the PCIe init in UEFI, and modified it to directly launch the linux image once the boot phase is ready. Then the linux run a minimal initrd which check the presence of /dev/nvme0n1 and tries to read some bytes from it if found. If any of those two tests fail, we log the failure. Once the tests are done, reboot and start again.
The results were the following:
So to resume, letting linux init the PCIe/NVMe did not fix the issue but it seems linux does a better job when uninitializing those. The Linux init wonât be an option as we cannot reduce its size to fit in the A/B partitioning, and we donât want to change it for now.