High failure rate during reliability testing of 32GB Xavier AGX modules

We have been using the 16GB modules (part number 900-82888-0000-000) in our product for several years now, but we’re now moving to the 32GB module (900-82888-0040-000) due to the 16GB module going obsolete. We’re re-doing our reliability testing to verify that there are no issues with the migration.

In our thermal-cycling test (-40degC to +85degC), we found that 6 out of 11 of our test units failed due to issues with the Xavier module failing to boot back up after a power-cycle. We removed the units from the test, verified that the units were failing to boot, and attempted to re-flash the modules using our standard manufacturing procedure. 3 of the modules recovered and are now working fine, but 3 of them failed the flashing procured with the following message:

...
[  10.4301 ] Erasing sdmmc_user: 3 ......... [Done]
[  10.9229 ] Writing partition master_boot_record with mbr_1_3.bin
[  10.9245 ] [................................................] 100%
[  10.9263 ] Writing partition primary_gpt with gpt_primary_1_3.bin
[  10.9319 ] [................................................] 100%
[  10.9335 ] 000000000d0d0001: o initialize partition table from GPT.
[  10.9462 ] 
[  10.9462 ] 
Error: Return value 1
Command tegradevflash_v2 --pt flash.xml.bin --create
Failed flashing t186ref.

I suspect the eMMC is getting corrupted and/or damaged during our testing. Are there any tools that we can use to evaluate the health of the eMMC flash?

Hi,

Do you evaluate the issue on devkit or your custom board?

This was tested on our full production device (using a custom board).

Hi,

Is it possible to evaluate by using devkit? We don’t receive customers reporting it would have almost half of modules suffer in flash problem.

Hi Wayne,

It would be hard for us to change our reliability test setup to use the devkit, and it also would no longer be a representative test of our product. Can Nvidia share the results of their devkit thermal-cycle testing? We can compare the test profiles and see if there are any obvious differences.

Are there any software tools that we can use to check the health of the eMMC on the 3 working units? I know there are tools like “smartctl” for NVME drives, but I’m not familiar with tools for eMMC. We can also RMA the 3 failed units for investigation by Nvidia if you have failure analysis capabilities.

Hi,

I don’t suggest to RMA device at this moment. This would take long time and I am not sure if I can really receive your device at all.

Also, please do not think I want you to do some fully test on devkit. You told us there is a flash problem. So my actual point here is if you can flash that module on devkit?

This is not related to some kind of thermal of emmc check. I just want to know if you can flash the module on devkit and if that is a repeatable behavior.

For example, remove the module from your custom board to devkit, flash it and see if it can flash.
Then we try the next debug step according to the result.

What’s the next debug step? Let’s skip straight to that step please.

If you are talking about you want to skip the test on devkit, then no. That one cannot be skipped.

If your device cannot flash even on devkit, then provide me the serial number and the exact BSP you are using to flash.

If your device can flash on devkit, then we need to check the serial console from UART when you failed to flash on your custom board.

Ok, I will try to find one of our devkits next week and test this out. In general, we don’t use devkits anymore because we have built our own custom product. Do most of your customers use the devkit?

Is there anything we can do to evalute the other 3 modules that are working? Is there a tool to check the health of the eMMC flash chip?

If you are testing a module, then testing on your carrier, if it says it works, should be definitive. However, if testing on your carrier board, then there are a lot of reasons to test instead on a dev kit. If the failure occurs on both dev kit and your carrier board for all of those modules, then it probably isn’t an issue of the third party carrier board. However, if it works on the dev kit, then there would be some tuning needed on the third party carrier board (e.g., something like drive strength might need a change for no reason other than longer/shorter traces).

Thanks, yes I will give the devkit a try on Tuesday. We have never had an issue like this with our custom boards, though, so I suspect that it’s a problem with the Xavier module.

Do you know why the eMMC was removed from the Orin Nano and Orin NX products? Was this due to reliability issues withe eMMC chip?

Basically, if you want to report any issue on this forum. Try to reproduce this issue on devkit first.

We are not able to debug things on your custom carrier board.

Some dev kit products were just designed with SD card instead of eMMC to make a lower cost product available. I doubt reliability has anything to do with it. Modules themselves seem to be very reliable. SD cards fail far more often then eMMC, and in the case of hardware failure, carrier boards are far more often at fault than modules. SD cards are just for people wanting to spend less.

I put one of our failing Xavier modules onto the devkit and tried flashing there. I get the same failure message. I captured the output of the debug UART, see below.

.359] I> MB1 (prd-version: 1.5.1.2-t194-41334769-9ec1833d)
[0002.365] I> Boot-mode: Platform RCM
[0002.368] I> Chip revision : A02P
[0002.371] I> Bootrom patch version : 15 (correctly patched)
[0002.376] I> ATE fuse revision : 0x200
[0002.380] I> Ram repair fuse : 0x0
[0002.383] I> Ram Code : 0x2
[0002.386] I> rst_source : 0x0
[0002.388] I> rst_level : 0x0
[0002.392] I> USB configuration success
[0004.558] I> mb2 image downloaded
[0004.644] I> Recovery boot mode 0
[0004.648] I> Boot-device: eMMC
[0004.653] I> UPHY full init done
[0004.657] I> MB1 done

[0004.661] W> Profiler not initialized
[0004.665] I> Welcome to MB2(TBoot-BPMP) Applet (version: 00.00.2018.32-mobile-2cd5f333)
[0004.673] W> Profiler not initialized
[0004.676] I> DMA Heap @ [0x40020000 - 0x40065800]
[0004.681] I> Default Heap @ [0xd486400 - 0xd48a400]
[0004.686] W> Profiler not initialized
[0004.689] W> Profiler not initialized
[0004.693] E> DEVICE_PROD: Invalid value data = 0, size = 0.
[0004.698] W> device prod register failed
[0004.702] W> Profiler not initialized
[0004.729] I> sdmmc DDR50 mode
[0004.733] I> No supported QSPI flash found
[0004.737] E> QSPI Flash: Insufficient flash size (0 MB)
[0004.742] I> QSPI Flash is not present.
[0004.793] E> Link startup dme_set failed
[0004.797] E> UFS initialization failed
[0004.800] I> UFS is not present
[0004.803] W> Profiler not initialized
[0004.810] I> Found 17 partitions in SDMMC_BOOT (instance 3)
[0004.817] W> Cannot find any partition table for 00010003
[0004.822] W> Profiler not initialized
[0004.826] W> Profiler not initialized
[0004.829] W> Profiler not initialized
[0004.833] I> Entering 3p server
[0004.836] I> USB configuration success
[0005.880] I> Populate eeprom info
[0005.884] I> Populate eeprom info for module cvm
[0006.099] I> Rebooting : reboot-recovery


[000> MB1 (prd-version: 1.5.1.2-t194-41334769-9ec1833d)
[0067.277] I> Boot-mode: RCM
[0067.280] I> Chip revision : A02P
[0067.283] I> Bootrom patch version : 15 (correctly patched)
[0067.288] I> ATE fuse revision : 0x200
[0067.291] I> Ram repair fuse : 0x0
[0067.294] I> Ram Code : 0x2
[0067.297] I> rst_source : 0xb
[0067.300] I> rst_level : 0x1
[0067.303] I> USB configuration success
[0069.434] I> bct_bootrom image downloaded
[0069.443] W> MB1_PLATFORM_CONFIG: device prod data is empty in MB1 BCT.
[0069.451] I> Temperature = 34000
[0069.454] W> Skipping boost for clk: BPMP_CPU_NIC
[0069.458] W> Skipping boost for clk: BPMP_APB
[0069.462] W> Skipping boost for clk: AXI_CBB
[0069.466] W> Skipping boost for clk: AON_CPU_NIC
[0069.471] W> Skipping boost for clk: CAN1
[0069.474] W> Skipping boost for clk: CAN2

Hi,

Please try to provide this information along with your uart log

  1. Which jetpack release are you trying to use now?

  2. What flash command did you run on your host side?

I used Jetpack 4.3 (this is the version of jetpack that we have been using for the last 4 years on all production devices).

I used the following flash command:
./flash.sh jetson-xavier mmcblk0p1

Hi,

Please try to flash these devices with jetpack4.6.1 first. We will tell you the next step after your test result.

Ok, I tried flashing jetpack4.6.1 and I received the same result.

Could you also share me the uart log when you got flash failure with jp4.6.1 and host side log?

Sorry. My mistake. Could you try to flash jp4.6.2 (32.7.2) instead of jp4.6.1?

The reason to test jp4.6.2 (or even jp5.0.2) is because of the PCN update here.