High failure rate during reliability testing of 32GB Xavier AGX modules

We have been using the 16GB modules (part number 900-82888-0000-000) in our product for several years now, but we’re now moving to the 32GB module (900-82888-0040-000) due to the 16GB module going obsolete. We’re re-doing our reliability testing to verify that there are no issues with the migration.

In our thermal-cycling test (-40degC to +85degC), we found that 6 out of 11 of our test units failed due to issues with the Xavier module failing to boot back up after a power-cycle. We removed the units from the test, verified that the units were failing to boot, and attempted to re-flash the modules using our standard manufacturing procedure. 3 of the modules recovered and are now working fine, but 3 of them failed the flashing procured with the following message:

...
[  10.4301 ] Erasing sdmmc_user: 3 ......... [Done]
[  10.9229 ] Writing partition master_boot_record with mbr_1_3.bin
[  10.9245 ] [................................................] 100%
[  10.9263 ] Writing partition primary_gpt with gpt_primary_1_3.bin
[  10.9319 ] [................................................] 100%
[  10.9335 ] 000000000d0d0001: o initialize partition table from GPT.
[  10.9462 ] 
[  10.9462 ] 
Error: Return value 1
Command tegradevflash_v2 --pt flash.xml.bin --create
Failed flashing t186ref.

I suspect the eMMC is getting corrupted and/or damaged during our testing. Are there any tools that we can use to evaluate the health of the eMMC flash?

Hi,

Do you evaluate the issue on devkit or your custom board?

This was tested on our full production device (using a custom board).

Hi,

Is it possible to evaluate by using devkit? We don’t receive customers reporting it would have almost half of modules suffer in flash problem.

Hi Wayne,

It would be hard for us to change our reliability test setup to use the devkit, and it also would no longer be a representative test of our product. Can Nvidia share the results of their devkit thermal-cycle testing? We can compare the test profiles and see if there are any obvious differences.

Are there any software tools that we can use to check the health of the eMMC on the 3 working units? I know there are tools like “smartctl” for NVME drives, but I’m not familiar with tools for eMMC. We can also RMA the 3 failed units for investigation by Nvidia if you have failure analysis capabilities.

Hi,

I don’t suggest to RMA device at this moment. This would take long time and I am not sure if I can really receive your device at all.

Also, please do not think I want you to do some fully test on devkit. You told us there is a flash problem. So my actual point here is if you can flash that module on devkit?

This is not related to some kind of thermal of emmc check. I just want to know if you can flash the module on devkit and if that is a repeatable behavior.

For example, remove the module from your custom board to devkit, flash it and see if it can flash.
Then we try the next debug step according to the result.

What’s the next debug step? Let’s skip straight to that step please.

If you are talking about you want to skip the test on devkit, then no. That one cannot be skipped.

If your device cannot flash even on devkit, then provide me the serial number and the exact BSP you are using to flash.

If your device can flash on devkit, then we need to check the serial console from UART when you failed to flash on your custom board.

Ok, I will try to find one of our devkits next week and test this out. In general, we don’t use devkits anymore because we have built our own custom product. Do most of your customers use the devkit?

Is there anything we can do to evalute the other 3 modules that are working? Is there a tool to check the health of the eMMC flash chip?

If you are testing a module, then testing on your carrier, if it says it works, should be definitive. However, if testing on your carrier board, then there are a lot of reasons to test instead on a dev kit. If the failure occurs on both dev kit and your carrier board for all of those modules, then it probably isn’t an issue of the third party carrier board. However, if it works on the dev kit, then there would be some tuning needed on the third party carrier board (e.g., something like drive strength might need a change for no reason other than longer/shorter traces).

Thanks, yes I will give the devkit a try on Tuesday. We have never had an issue like this with our custom boards, though, so I suspect that it’s a problem with the Xavier module.

Do you know why the eMMC was removed from the Orin Nano and Orin NX products? Was this due to reliability issues withe eMMC chip?

Basically, if you want to report any issue on this forum. Try to reproduce this issue on devkit first.

We are not able to debug things on your custom carrier board.