Sudden Shutdown - Temperature too high?

Hello,

we are using a Jetson Xavier which is confronted with lots of GPU and some CPU load.
After a few hours, it suddenly switches off.

Below you can see tegratats output until the end (interval 5 seconds):

RAM 11306/15700MB (lfb 9x1MB) CPU [98%@2265,26%@2265,36%@2265,30%@2265,27%@2265,26%@2265,25%@2265,34%@2265] EMC_FREQ 62%@2133 GR3D_FREQ 99%@1377 APE 150 MTS fg 2% bg 0% AO@66C GPU@71.5C Tboard@62C Tdiode@65.75C AUX@63C CPU@69.5C thermal@67.5C PMIC@100C GPU 19807/17759 CPU 6106/4963 SOC 5510/5251 CV 0/0 VDDRQ 2977/2877 SYS5V 4654/4576
RAM 11306/15700MB (lfb 8x1MB) CPU [98%@2188,34%@2188,32%@2220,27%@2020,29%@2062,30%@2265,28%@2265,25%@2265] EMC_FREQ 59%@2133 GR3D_FREQ 99%@1377 APE 150 MTS fg 1% bg 0% AO@66C GPU@71.5C Tboard@62C Tdiode@65.75C AUX@63C CPU@69.5C thermal@67.2C PMIC@100C GPU 18341/17760 CPU 5813/4964 SOC 5216/5251 CV 0/0 VDDRQ 2830/2877 SYS5V 4574/4576
RAM 11307/15700MB (lfb 7x1MB) CPU [97%@2265,37%@2265,27%@2265,30%@2265,28%@2221,29%@2188,27%@2188,26%@2145] EMC_FREQ 60%@2133 GR3D_FREQ 99%@1377 APE 150 MTS fg 0% bg 0% AO@66C GPU@71.5C Tboard@62C Tdiode@65.75C AUX@63C CPU@69C thermal@67.35C PMIC@100C GPU 19219/17760 CPU 5810/4964 SOC 5363/5251 CV 0/0 VDDRQ 2979/2877 SYS5V 4614/4576
RAM 11307/15700MB (lfb 7x1MB) CPU [97%@2265,30%@2265,31%@2265,32%@2265,26%@2265,26%@2265,31%@2265,28%@2265] EMC_FREQ 60%@2133 GR3D_FREQ 99%@1377 APE 150 MTS fg 2% bg 0% AO@66C GPU@72C Tboard@62C Tdiode@65.75C AUX@63C CPU@70C thermal@67.5C PMIC@100C GPU 19219/17761 CPU 6108/4964 SOC 5361/5251 CV 0/0 VDDRQ 2978/2877 SYS5V 4654/4576
RAM 11308/15700MB (lfb 7x1MB) CPU [97%@2265,29%@2265,30%@2265,33%@2265,26%@2265,30%@2265,27%@2265,26%@2265] EMC_FREQ 58%@2133 GR3D_FREQ 90%@1377 APE 150 MTS fg 2% bg 0% AO@66C GPU@71.5C Tboard@62C Tdiode@65.5C AUX@63C CPU@69.5C thermal@67.35C PMIC@100C GPU 18043/17761 CPU 5813/4965 SOC 5216/5251 CV 0/0 VDDRQ 2832/2877 SYS5V 4574/4576
RAM 11307/15700MB (lfb 7x1MB) CPU [97%@1410,29%@1881,30%@2035,34%@2244,29%@2188,26%@2265,26%@2265,30%@2265] EMC_FREQ 62%@2133 GR3D_FREQ 99%@1377 APE 150 MTS fg 1% bg 0% AO@66C GPU@72C Tboard@62C Tdiode@65.75C AUX@63C CPU@69C thermal@67.65C PMIC@100C GPU 19956/17761 CPU 5810/4965 SOC 5510/5251 CV 0/0 VDDRQ 2977/2877 SYS5V 4654/4576
RAM 11307/15700MB (lfb 7x1MB) CPU [96%@2265,31%@2265,31%@2265,28%@2265,29%@2265,31%@2265,27%@2026,27%@2079] EMC_FREQ 62%@2133 GR3D_FREQ 99%@1377 APE 150 MTS fg 2% bg 0% AO@66C GPU@72C Tboard@62C Tdiode@65.75C AUX@63C CPU@69.5C thermal@67.65C PMIC@100C GPU 19517/17762 CPU 5957/4965 SOC 5510/5251 CV 0/0 VDDRQ 2978/2877 SYS5V 4654/4576
RAM 11307/15700MB (lfb 7x1MB) CPU [97%@2265,31%@2265,32%@2265,29%@2265,28%@2265,29%@2265,28%@2265,28%@2265] EMC_FREQ 58%@2133 GR3D_FREQ 99%@1377 APE 150 MTS fg 1% bg 0% AO@66C GPU@71.5C Tboard@62C Tdiode@65.75C AUX@63C CPU@69C thermal@67.35C PMIC@100C GPU 17894/17762 CPU 5664/4966 SOC 5216/5251 CV 0/0 VDDRQ 2832/2877 SYS5V 4574/4576
RAM 11304/15700MB (lfb 8x1MB) CPU [97%@2265,34%@2265,31%@2265,29%@2265,26%@2265,27%@2265,28%@2265,30%@2254] EMC_FREQ 60%@2133 GR3D_FREQ 87%@1377 APE 150 MTS fg 3% bg 0% AO@66C GPU@72C Tboard@62C Tdiode@65.75C AUX@63C CPU@70C thermal@67.65C PMIC@100C GPU 18921/17763 CPU 5810/4966 SOC 5363/5251 CV 0/0 VDDRQ 2978/2877 SYS5V 4654/4577
RAM 11305/15700MB (lfb 8x1MB) CPU [98%@2265,31%@2265,33%@2265,29%@2265,33%@2265,27%@2265,26%@2265,27%@2265] EMC_FREQ 61%@2133 GR3D_FREQ 99%@1377 APE 150 MTS fg 2% bg 0% AO@66C GPU@71.5C Tboard@62C Tdiode@65.75C AUX@63C CPU@70C thermal@67.65C PMIC@100C GPU 18632/17763 CPU 5959/4966 SOC 5363/5251 CV 0/0 VDDRQ 2830/2877 SYS5V 4574/4576
RAM 11305/15700MB (lfb 8x1MB) CPU [98%@2265,35%@2265,34%@2265,28%@2265,28%@2265,26%@2265,28%@2265,27%@2265] EMC_FREQ 60%@2133 GR3D_FREQ 99%@1377 APE 150 MTS fg 2% bg 0% AO@66C GPU@71.5C Tboard@62C Tdiode@65.75C AUX@63C CPU@69.5C thermal@67.5C PMIC@100C GPU 18333/17763 CPU 5813/4966 SOC 5216/5251 CV 0/0 VDDRQ 2830/2877 SYS5V 4574/4576
RAM 11306/15700MB (lfb 8x1MB) CPU [98%@2265,30%@2265,36%@2265,28%@2265,27%@1981,28%@1876,27%@2265,30%@2188] EMC_FREQ 62%@2133 GR3D_FREQ 99%@1377 APE 150 MTS fg 2% bg 0% AO@66C GPU@72C Tboard@62C Tdiode@65.75C AUX@63C CPU@70C thermal@68C PMIC@100C GPU 19807/17764 CPU 6106/4967 SOC 5510/5251 CV 0/0 VDDRQ 2977/2877 SYS5V 4654/4577
RAM 11305/15700MB (lfb 8x1MB) CPU [98%@2265,30%@2265,31%@2265,27%@2265,31%@2265,29%@2265,27%@2265,29%@2265] EMC_FREQ 60%@2133 GR3D_FREQ 99%@1377 APE 150 MTS fg 2% bg 0% AO@66C GPU@72C Tboard@62C Tdiode@65.75C AUX@63C CPU@69.5C thermal@67.5C PMIC@100C GPU 18930/17764 CPU 5661/4967 SOC 5214/5251 CV 0/0 VDDRQ 2830/2877 SYS5V 4574/4577
RAM 11307/15700MB (lfb 8x1MB) CPU [99%@2116,31%@1420,34%@2265,31%@2265,26%@2265,27%@2265,31%@2265,30%@2265] EMC_FREQ 61%@2133 GR3D_FREQ 99%@1377 APE 150 MTS fg 2% bg 0% AO@66.5C GPU@72.5C Tboard@62C Tdiode@66C AUX@63C CPU@69C thermal@67.95C PMIC@100C GPU 19219/17765 CPU 5810/4967 SOC 5363/5251 CV 0/0 VDDRQ 2978/2877 SYS5V 4614/4577
RAM 11308/15700MB (lfb 8x1MB) CPU [99%@2265,27%@2265,38%@2259,34%@2260,27%@2265,30%@2265,27%@2265,26%@2265] EMC_FREQ 60%@2133 GR3D_FREQ 99%@1377 APE 150 MTS fg 1% bg 0% AO@66C GPU@72.5C Tboard@62C Tdiode@66C AUX@63.5C CPU@70C thermal@67.85C PMIC@100C GPU 19368/17765 CPU 6106/4968 SOC 5361/5251 CV 0/0 VDDRQ 2977/2877 SYS5V 4654/4577
RAM 11308/15700MB (lfb 8x1MB) CPU [99%@2213,34%@2244,34%@2265,33%@2265,26%@2153,26%@1728,27%@2040,26%@2166] EMC_FREQ 60%@2133 GR3D_FREQ 99%@1377 APE 150 MTS fg 2% bg 0% AO@66.5C GPU@72.5C Tboard@62C Tdiode@66C AUX@63.5C CPU@69.5C thermal@67.85C PMIC@100C GPU 19368/17766 CPU 5808/4968 SOC 5361/5251 CV 0/0 VDDRQ 2978/2877 SYS5V 4614/4577
RAM 11308/15700MB (lfb 8x1MB) CPU [99%@2265,29%@2265,33%@2265,35%@2265,28%@2265,24%@2265,26%@2265,28%@2265] EMC_FREQ 61%@2133 GR3D_FREQ 99%@1377 APE 150 MTS fg 2% bg 0% AO@66C GPU@72C Tboard@62C Tdiode@66C AUX@63.5C CPU@70C thermal@68C PMIC@100C GPU 19658/17766 CPU 6106/4968 SOC 5510/5251 CV 0/0 VDDRQ 2977/2878 SYS5V 4654/4577

Is it being switched off because of a high temperature in one sensor?
I checked the thermal design guide and the temperatures seem to be fine. I did not find a max value for the Tdiode?

Thank you very much!

You could install “extra” cooling on your module (watercooler, more fans, run it in a refrigerator, or whatever) and test the overheating theory by seeing if it still turns off, and if so, what the temperature readings were when it did.

Hi, is there any log info when shutdown happen?

Hi,

checked the syslog and there is nothing special.

Did you try comment #2 to use external cooling? It’s hard to tell what the problem is since no special in log. If necessary, you can run RMA for it.

I’m serious: Put the module into your fridge (or even freezer) and run it from there, and see if it does better.

Hi.Has this problem been solved?
I have the same problem when I did the high temperature test(MaxN mode & external cooling), it worked ok at room temperature.
When it shutdown, the ambient temperature is only 50℃,and CPU/GPU/AUX temperature are about 70℃,not reach high temperature limit.
what’s the reason of this problem?

Hi,

we “solved” the problem by adding an extra fan to our robot. Now it works fine so far.

Unfortunately we were not able to find the reason of this problem.

We found that it might be caused by insufficient power supply.
As the temperature increases, the power performance decreases.