Failing TK1 Custom board

I am hoping that someone might be able to help me diagnose some failures we have been seeing on our custom TK1 board.

We are using the TD575 variant. Our system is configured to allow high temperature operation with CPU temperatures held below 90C but there can be extended periods of time where the CPU is at ~80 - 85C

We have had numerous failures that all have very similar signatures - the board appears to be operating well, and then one day, it simply hangs, and from then on it either fails to boot or freezes during boot.

We often see errors on the console like:

[0000.249] No Battery Present
[0000.252] Sdram initialization is successful
[0000.296] Instance[1] bootloader is corrupted trying for next Instance !
[0000.305] No Bootloader is found !
[0000.308] Error in NvTbootLoadBinary: 0x13 !
[0000.312] Error is 13

But sometimes the system will boot fully and then hang with no error message at all.

Often the behavior changes with temperature (if we heat or cool the board dramatically, we will get a different error).

We originally thought that this was a solder quality issue, or possibly a PCB fab issue, but after numerous Die and Pry, Cross section, C-SAM, continuity checks etc, we no longer believe that the physical board is the cause of the failures. We also suspected faulty EMMC but so far we have found no evidence of this either.

There are a few symptoms that I find interesting and I am wondering if anyone might be able to suggest what we should look into:

  1. I usually notice that when the board is frozen the CPU is still extremely hot (even after a long time). So, even though the system is frozen, we are burning considerable power. I have watched with a FLIR and can see that the CPU is continuing to heat up.

  2. On 1 system I was able to see that prior to the failure event the “SOC_THERM_TSENSOR_TEMP2_0 - MEM” temperature rose very rapidly. It reached 100C and then the system shutdown. I have never been able to reproduce that behavior on any other board. After it failed, that same board now behaves as follows:

When cold booted, it fails to boot with the “No Bootloader Found” error. However, if I do a hard reset to the PMIC the SoC starts getting very hot, specifically in the corner nearest to pin A1. The system fails with the same error message.

At this point, I am not sure how I can root cause this failure. So far, I have not found any consistent way to re-create a failure on a good board. We have also been unable to pinpoint the actual failure location. Does anyone have any suggestions on techniques or tools we could use to help root cause and resolve this?

Hi Gabriel16,

It will be helpful to locate root cause if you can clarify below several questions:

  1. How many boards in total you made? How many boards have such problems?
  2. What’s the detail behavior of each failed board?
  3. What’s your thermal dissipation, fan or not?
  4. What’s the value of current when this issue happen? Did you check the power rail of CPU to see if it is shorted to GND?