We have developed a custom carrier board for Jetson TX2 and have used it for multiple iterations of TX2 SOMs. On recent batches of SOMs we’ve ordered we’ve noticed very specific and unusual behavior, and have reproduced on multiple SOMs.
The devices initially program without issue on our carrier board.
After program, the devices crash with kernel oops attempting to access the camera sensor.
[0001.811] C> MTS error (2) : dram alias check failure
[0001.816] C> cpu waypoint 0.5 failed
[0001.820] C> ERROR: Highest Layer Module = 0x32, Lowest Layer Module = 0x32,
Aux Info = 0x1, Reason = 0x6
Attempting to reprogram in a TX2 carrier board works without issue.
Taking the device out of the carrier board after reprogramming successfully and placing in our mainboard results in the same CBoot error message, and the part refuses to boot,
[0001.811] C> MTS error (2) : dram alias check failure
[0001.816] C> cpu waypoint 0.5 failed
[0001.820] C> ERROR: Highest Layer Module = 0x32, Lowest Layer Module = 0x32,
Aux Info = 0x1, Reason = 0x6
The SOM boots fine when plugged into a Carrier board.
So the problem appears to be related to our custom carrier board.
One thing we suspected was the auto power on notice about delay in asserting CHARGER_PRSNT low as discussed in Power-on Autostart - #4 by Trumany. We don’t have this delay implemented in our hardware, we are simply shorting CHARGER_PRSNT low. However, the content of that post makes it seem like the issues would occur on shutdown and would not explain the observations here. We’ve also tried shorting CHARGER_PRSNT low on the development board and we can’t reproduce boot issues in this case.
We are wondering if you can share any detail about changes in the SOM in recent revisions which may explain this behavior, especially related to revision 699-83310-1000-D00 M, or where we should look on the carrier board design to explain the dram failure above.
The first tegraflash was actually L4T 28.2.1 given the way our production process is setup. All subsequent tegraflash reprogram attempts were L4T 32.4.3, including completed and failed attempts across our carrier board and the TX2 dev kit hardware.
I believe we’ve also attempted unsuccessfully to take a SOM which was successfully tegraflashed with L4T 32.4.3 on a TX2 development board and re-run tegraflash with our carrier board to the same L4T 32.4.3 release. I can re-verify this.
May I get a more clear test result here? Looks like We have rel-28/rel-32, devkit/custom carrier board and D00 /non-D00 modules.
Could you show me a table that marks all the test result (pass/fail) here with all the combinations you’ve tested? Not very sure about them just by reading the description so far.
I should mention we are using meta-tegra so this is invoked from tegra186-flash-helper.sh. We are planning to compare the behavior with stock nvidia L4T programming tomorrow.
One other interesting observation from today, we are able to tegraflash from a non-booting R32 configuration back to R28 on our carrier board, so it appears to be something about the combination of the tegraflash step for R32.4.3 and our carrier board setup.
I’ve reproduced the same thing with the NVIDIA SDK manager and release 32.4.3, so it’s not meta-tegra specific
Serial logs:
E> Waypoint-0.5 ACK pending: 0x8
[0247.902] C> MTS error (2) : dram alias check failure
[0247.907] C> cpu waypoint 0.5 failed
[0247.911] C> ERROR: Highest Layer Module = 0x32, Lowest Layer Module = 0x32,
Aux Info = 0x1, Reason = 0x6
Host logs:
I’ve also noticed what looks like a similar issue here which looks like it describes what I’m seeing, however unfortunately I wasn’t able to boot even after disconnecting the USB cable connection.
Same observation as JetPack 4.4 will not flash on custom board - #24 by chadiris regarding Jetpack 4.3. If I use the SDK manager with Jetpack 4.3 instead of 4.4 everything works. So the problem appears to be related to changes in the flash update tools for Jetpack 4.4 combined with something about differences between hardware other than the dev board hardware.
Back to my earlier question:
We are wondering if you can share any detail about changes in the SOM in recent revisions which may explain this behavior, especially related to revision 699-83310-1000-D00 M, or where we should look on the carrier board design to explain the dram failure above.
Any suggestions for us as to where to look? Just trying to narrow down the pin scope to something less than 400. My first suspicion was power related pins, however I’m not seeing anything that is jumping out at me as a difference comparing to the dev board and I’ve removed our power supply from the equation by powering from a benchtop supply.
Here is my summary so far. Please confirm if it is right.
D00 module + devkit + jp4.4 → Good
D00 module + custom carrier + jp4.4 → NG
Non- D00 module + custom carrier + jp 4.4 → Good
D00 module + customer carrier + jp4.3 → Good
Are you able to switch some binaries/scripts from rel-32.3.1 to rel-32.4.3 and see if it can make it work? For example, flash.sh or nvtboot_recovery.bin.
Yes, that’s right. For “Non- D00 module” the B rev SOM is the only one we’ve tried so far.
Are you able to switch some binaries/scripts from rel-32.3.1 to rel-32.4.3 and see if it can make it work? For example, flash.sh or nvtboot_recovery.bin.
Are you able to switch some binaries/scripts from rel-32.3.1 to rel-32.4.3 and see if it can make it work? For example, flash.sh or nvtboot_recovery.bin.
No change when rolling both of these back to 32.3.1
Also no change when rolling any of the tegra186-flashtools-native back to 32.3.1
Since the error mentions dram I looked for dram references and noticed this difference between JP 4.4 and JP 4.3:
Copying dram-ecc.bin from JP 4.3 doesn’t help either.
In all cases above I still get the same
[0070.255] E> Waypoint-0.5 ACK pending: 0x8
[0070.259] C> MTS error (2) : dram alias check failure
[0070.264] C> cpu waypoint 0.5 failed
[0070.267] C> ERROR: Highest Layer Module = 0x32, Lowest Layer Module = 0x32,
Aux Info = 0x1, Reason = 0x6
If I understand the cboot source correctly and line 116 of bootloader/partner/common/include/tegrabl_error.h Reason 0x6 is #define TEGRABL_ERR_TIMEOUT 0x06U and the 0x32 module refers to #define TEGRABL_ERR_CPUINIT 0x32U.
Have you made any changes to mem bct and bpmp dtb file?
dram failure could point to a different DRAM used in D00 which you have not accounted for.
Random kernel oops could also be due to memory not stable.
What is the voltage you are supplying, 19v?
Is your board +D00 module able to be flashed on rel-28.2.1 and rel-28.4?
We’ve only tried 28.2.1, 32.3.1 and 32.4.3. It works on 28.2.1, 32.2.3, fails on 32.4.3.
No, and in all of the SDK Manager cases I’m not making any changes to the stock L4T flashing binaries or sequence (other than what I’ve listed above as troubleshooting steps).