Dram alias check failure on D00 revision TX2 SOM and custom carrier board

Hi,

We have developed a custom carrier board for Jetson TX2 and have used it for multiple iterations of TX2 SOMs. On recent batches of SOMs we’ve ordered we’ve noticed very specific and unusual behavior, and have reproduced on multiple SOMs.

  • The devices initially program without issue on our carrier board.
  • After program, the devices crash with kernel oops attempting to access the camera sensor.
  • Attempting to reprogram in our carrier fails with the same message discussed in the thread at MTS error (2) : dram alias check failure on boot
[0001.811] C> MTS error (2) : dram alias check failure
[0001.816] C> cpu waypoint 0.5 failed
[0001.820] C> ERROR: Highest Layer Module = 0x32, Lowest Layer Module = 0x32,
Aux Info = 0x1, Reason = 0x6
  • Attempting to reprogram in a TX2 carrier board works without issue.
  • Taking the device out of the carrier board after reprogramming successfully and placing in our mainboard results in the same CBoot error message, and the part refuses to boot,
[0001.811] C> MTS error (2) : dram alias check failure
[0001.816] C> cpu waypoint 0.5 failed
[0001.820] C> ERROR: Highest Layer Module = 0x32, Lowest Layer Module = 0x32,
Aux Info = 0x1, Reason = 0x6
  • The SOM boots fine when plugged into a Carrier board.

So the problem appears to be related to our custom carrier board.

One thing we suspected was the auto power on notice about delay in asserting CHARGER_PRSNT low as discussed in Power-on Autostart. We don’t have this delay implemented in our hardware, we are simply shorting CHARGER_PRSNT low. However, the content of that post makes it seem like the issues would occur on shutdown and would not explain the observations here. We’ve also tried shorting CHARGER_PRSNT low on the development board and we can’t reproduce boot issues in this case.

We are wondering if you can share any detail about changes in the SOM in recent revisions which may explain this behavior, especially related to revision 699-83310-1000-D00 M, or where we should look on the carrier board design to explain the dram failure above.

Hi,

What is your release?

And what does that mean the board is initially working after “programmed”? What is programmed here?

Hi @WayneWWW
Thanks for the response

What is programmed here?

Tegraflashed

What is your release?

The first tegraflash was actually L4T 28.2.1 given the way our production process is setup. All subsequent tegraflash reprogram attempts were L4T 32.4.3, including completed and failed attempts across our carrier board and the TX2 dev kit hardware.

I believe we’ve also attempted unsuccessfully to take a SOM which was successfully tegraflashed with L4T 32.4.3 on a TX2 development board and re-run tegraflash with our carrier board to the same L4T 32.4.3 release. I can re-verify this.

Hi,

May I get a more clear test result here? Looks like We have rel-28/rel-32, devkit/custom carrier board and D00 /non-D00 modules.

Could you show me a table that marks all the test result (pass/fail) here with all the combinations you’ve tested? Not very sure about them just by reading the description so far.

Hi @WayneWWW
Here’s the table:
image

Hi,

But the custom carrier board is working fine with older revision of SOM, right?

Correct. We have noticed B rev SOMs which work fine on the same carrier board in at least one instance.

Hello,

May I get your full log from uart?

@WayneWWW here’s a log from tegraflash:

[0000.631] I> Loading SCE-FW ...
[0000.634] W> No valid slot number is found in scratch register
[0000.640] W> Return default slot: _a
[0000.643] I> A/B: bin_type (12) slot 0
[0000.647] I> Loading partition sce-fw at 0xd7300000
[0000.652] I> Reading two headers - addr:0xd7300000 blocks:1
[0000.657] I> Addr: 0xd7300000, start-block: 5904752, num_blocks: 1
[0000.666] I> Binary(12) of size 76592 is loaded @ 0xd7300000
[0000.672] I> Init SCE
[0000.674] I> Copy BTCM section
[0000.677] W> No valid slot number is found in scratch register
[0000.683] W> Return default slot: _a
[0000.686] I> A/B: bin_type (13) slot 0
[0000.690] I> Loading partition cpu-bootloader at 0x96000000
[0000.695] I> Reading two headers - addr:0x96000000 blocks:1
[0000.701] I> Addr: 0x96000000, start-block: 5879856, num_blocks: 1
[0000.713] I> Binary(13) of size 282736 is loaded @ 0x96000000
[0000.719] W> No valid slot number is found in scratch register
[0000.725] W> Return default slot: _a
[0000.728] I> A/B: bin_type (20) slot 0
[0000.732] I> Loading partition bootloader-dtb at 0x8520f400
[0000.737] I> Reading two headers - addr:0x8520f400 blocks:1
[0000.743] I> Addr: 0x8520f400, start-block: 5881904, num_blocks: 1
[0101.875] E> Waypoint-0.5 ACK pending: 0x8
[0101.879] C> MTS error (2) : dram alias check failure
[0101.884] C> cpu waypoint 0.5 failed
[0101.887] C> ERROR: Highest Layer Module = 0x32, Lowest Layer Module = 0x32,
Aux Info = 0x1, Reason = 0x6

Here’s a log from boot:

[0001.811] C> MTS error (2) : dram alias check failure
[0001.816] C> cpu waypoint 0.5 failed
[0001.820] C> ERROR: Highest Layer Module = 0x32, Lowest Layer Module = 0x32,
Aux Info = 0x1, Reason = 0x6

Hello,

Sorry for one more request. Could you post the whole flash log from host side?

I don’t have the full host side log but I can get it tomorrow. Here’s where it hangs:

[   6.8878 ] Sending bootloader and pre-requisite binaries
[   6.8892 ] tegrarcm_v2 --download blob blob.bin
[   6.8903 ] Applet version 01.00.0000
[   6.9113 ] Sending blob
[   6.9113 ] [................................................] 100%
[   7.3700 ] 
[   7.3732 ] tegrarcm_v2 --boot recovery
[   7.3756 ] Applet version 01.00.0000
[   7.3801 ] 
[   8.3838 ] tegrarcm_v2 --isapplet
[   9.1352 ] 
[   9.1390 ] tegradevflash_v2 --iscpubl
[   9.1401 ] Cannot Open USB
[   9.6000 ] 
[  10.6039 ] tegrarcm_v2 --isapplet

I should mention we are using meta-tegra so this is invoked from tegra186-flash-helper.sh. We are planning to compare the behavior with stock nvidia L4T programming tomorrow.

One other interesting observation from today, we are able to tegraflash from a non-booting R32 configuration back to R28 on our carrier board, so it appears to be something about the combination of the tegraflash step for R32.4.3 and our carrier board setup.

Hi,

Please just use pure flash.sh. Do not use something like meta-tegra. They are not our official tool so we cannot guarantee their functionality.

I’ve reproduced the same thing with the NVIDIA SDK manager and release 32.4.3, so it’s not meta-tegra specific

Serial logs:

E> Waypoint-0.5 ACK pending: 0x8                                                    
[0247.902] C> MTS error (2) : dram alias check failure                                                             
[0247.907] C> cpu waypoint 0.5 failed                                                                              
[0247.911] C> ERROR: Highest Layer Module = 0x32, Lowest Layer Module = 0x32,                                      
Aux Info = 0x1, Reason = 0x6  

Host logs:

I’ve also noticed what looks like a similar issue here which looks like it describes what I’m seeing, however unfortunately I wasn’t able to boot even after disconnecting the USB cable connection.

Same observation as JetPack 4.4 will not flash on custom board regarding Jetpack 4.3. If I use the SDK manager with Jetpack 4.3 instead of 4.4 everything works. So the problem appears to be related to changes in the flash update tools for Jetpack 4.4 combined with something about differences between hardware other than the dev board hardware.

Back to my earlier question:

We are wondering if you can share any detail about changes in the SOM in recent revisions which may explain this behavior, especially related to revision 699-83310-1000-D00 M, or where we should look on the carrier board design to explain the dram failure above.

Any suggestions for us as to where to look? Just trying to narrow down the pin scope to something less than 400. My first suspicion was power related pins, however I’m not seeing anything that is jumping out at me as a difference comparing to the dev board and I’ve removed our power supply from the equation by powering from a benchtop supply.

Hello danwalkes1,

Here is my summary so far. Please confirm if it is right.

  1. D00 module + devkit + jp4.4 -> Good
  2. D00 module + custom carrier + jp4.4 -> NG
  3. Non- D00 module + custom carrier + jp 4.4 -> Good
  4. D00 module + customer carrier + jp4.3 -> Good

Are you able to switch some binaries/scripts from rel-32.3.1 to rel-32.4.3 and see if it can make it work? For example, flash.sh or nvtboot_recovery.bin.

Yes, that’s right. For “Non- D00 module” the B rev SOM is the only one we’ve tried so far.

Are you able to switch some binaries/scripts from rel-32.3.1 to rel-32.4.3 and see if it can make it work? For example, flash.sh or nvtboot_recovery.bin.

Thanks for the suggestion. I’ve prepared a branch at https://github.com/BoulderAI/meta-tegra/tree/flashtools-32.3.1-hacks which I intend to test for this, which rolls back tegra186-flashtools-native to 32.3.1.

Are you able to switch some binaries/scripts from rel-32.3.1 to rel-32.4.3 and see if it can make it work? For example, flash.sh or nvtboot_recovery.bin.

No change when rolling both of these back to 32.3.1

Also no change when rolling any of the tegra186-flashtools-native back to 32.3.1

Since the error mentions dram I looked for dram references and noticed this difference between JP 4.4 and JP 4.3:

dan@yocto:/build/nvidia/nvidia_sdk/JetPack_4.4_Linux_JETSON_TX2/Linux_for_Tegra$ sudo find . -name "dram-*"
./bootloader/8755/dram-ecc_sigheader.bin.hash
./bootloader/8755/dram-ecc_sigheader.bin.encrypt
./bootloader/8755/dram-ecc.bin
./bootloader/8755/dram-ecc_sigheader.bin
./bootloader/dram-ecc.bin
dan@yocto:/build/nvidia/nvidia_sdk/JetPack_4.4_Linux_JETSON_TX2/Linux_for_Tegra$ sudo find ../../JetPack_4.3_Linux_JETSON_TX2/ -name "dram-*"
../../JetPack_4.3_Linux_JETSON_TX2/Linux_for_Tegra/bootloader/dram-ecc.bin

Copying dram-ecc.bin from JP 4.3 doesn’t help either.

In all cases above I still get the same

[0070.255] E> Waypoint-0.5 ACK pending: 0x8                                                                                                           
[0070.259] C> MTS error (2) : dram alias check failure                                                                                                
[0070.264] C> cpu waypoint 0.5 failed                                                                                                                 
[0070.267] C> ERROR: Highest Layer Module = 0x32, Lowest Layer Module = 0x32,                                                                         
Aux Info = 0x1, Reason = 0x6  

If I understand the cboot source correctly and line 116 of bootloader/partner/common/include/tegrabl_error.h Reason 0x6 is #define TEGRABL_ERR_TIMEOUT 0x06U and the 0x32 module refers to #define TEGRABL_ERR_CPUINIT 0x32U.

Hi,

Just for some test

Is your board +D00 module able to be flashed on rel-28.2.1 and rel-28.4?

Have you made any changes to mem bct and bpmp dtb file?
dram failure could point to a different DRAM used in D00 which you have not accounted for.
Random kernel oops could also be due to memory not stable.
What is the voltage you are supplying, 19v?

Is your board +D00 module able to be flashed on rel-28.2.1 and rel-28.4?

We’ve only tried 28.2.1, 32.3.1 and 32.4.3. It works on 28.2.1, 32.2.3, fails on 32.4.3.

No, and in all of the SDK Manager cases I’m not making any changes to the stock L4T flashing binaries or sequence (other than what I’ve listed above as troubleshooting steps).

What is the voltage you are supplying, 19v?

We are supplying 12V on VDD_IN