Jetson Nano DevKit Reboots when overheating

Hello,

I am working with a Jetson Nano Devkit model P3450.

I am running some video processing scripts which load up the components as much as possible. I am running these in normal temperature and inside a heat chamber up to 60C.

During the heat chamber test, I am seeing a strange issue - after about 30-60 minutes of operation, the Jetson reboots itself. It seems to be an overheating issue according to the file var/log/kern.log.

The strange part about this, is that I am monitoring the AO/CPU/GPU temps when this happens. They are hovering at 95/90/89 respectively, which is not hot enough to cause the Jetson to overheat and turn off. The Jetson is not behaving like it should if it overheats: its not just shutting down, but simply rebooting.

Are there other components temps that I should be monitoring?
Are there any other logs that I can check for a more detailed report on the reason for restart?

Here are the entries that I see right before reboot (full log attached):

Jun 16 14:38:28 iza-123456 kernel: [ 1662.727799] tegra_soctherm 700e2000.soctherm: soctherm: trip temperature 2147483647 forced to 127000
Jun 16 14:38:29 iza-123456 kernel: [ 1663.751781] tegra_soctherm 700e2000.soctherm: soctherm: trip temperature 2147483647 forced to 127000
Jun 16 14:38:30 iza-123456 kernel: [ 1664.775772] tegra_soctherm 700e2000.soctherm: soctherm: trip temperature 2147483647 forced to 127000
Jun 16 14:38:31 iza-123456 kernel: [ 1665.799699] tegra_soctherm 700e2000.soctherm: soctherm: trip temperature 2147483647 forced to 127000
Jun 16 14:39:00 iza-123456 kernel: [ 0.000000] Booting Linux on physical CPU 0x0
Jun 16 14:39:00 iza-123456 kernel: [ 0.000000] Linux version 4.9.140-tegra (buildbrain@mobile-u64-2713) (gcc version 7.3.1 20180425 [linaro-7.3-2018.05 revision d29120a424ecfbc167ef90065c0eeb7f91977701] (Linaro GCC 7.3-2018.05) ) #1 SMP PREEMPT Mon Dec 9 22:47:42 PST 2019
Jun 16 14:39:00 iza-123456 kernel: [ 0.000000] Boot CPU: AArch64 Processor [411fd071]
Jun 16 14:39:00 iza-123456 kernel: [ 0.000000] OF: fdt:memory scan node memory@80000000, reg size 48,
Jun 16 14:39:00 iza-123456 kernel: [ 0.000000] OF: fdt: - 80000000 , 7ee00000
Jun 16 14:39:00 iza-123456 kernel: [ 0.000000] OF: fdt: - 100000000 , 7f200000

IZA_kern.log (1.5 MB)

hello yegor,

may I know had you design a heatsink for your use-case?
it seems you encounter hardware throttling and also triggers a software shutdown, please also check Thermal Management session for more details.
thanks

Hello Jerry,
Thank you for your response!
We have tested with both a designed heatsink and a basic one that comes with the DevKit. Both give the same results (Jetson reboots as described above).
According to the Thermal Management guides, which we have read carefully, the Jetson is supposed to shut off in the case of overheat. In our case, the Jetson simply reboots. There is no delay between shutdown and system boot-up. Can you explain this behavior?

hello yegor,

assume there’s continuous temperature increase scenario.
at first, there will be clock throttling to reduce the clock rate while it catch software throttling thermal zone; if the SoC still overheating, in order to protect the chip, it’ll eventually performs a hardware shutdown by asserting the reset pin on the PMIC.
you may also enable tegrastats utility to monitor the overall usage reporting.
thanks

Hello Jerry,

This is well understood, thank you. We are using a custom script that outputs tegrastats infromation throughout the test. As I mentioned, we DO NOT see the temperature reach the hardware shutdown thresholds. Here is a sample line:

RAM 3224/3964MB (lfb 101x4MB) SWAP 460/1982MB (cached 4MB) CPU [97%@1479,94%@1479,92%@1479,99%@1479] EMC_FREQ 0% GR3D_FREQ 99% PLL@88C CPU@94.5C PMIC@100C GPU@91C AO@101.5C thermal@93C POM_5V_IN 11985/10288 POM_5V_GPU 4389/3463 POM_5V_CPU 3726/3435

You mentioned that the Jetson performs a hardware shutdown, but we are seeing a reboot. Is there any other log that we can check to better understand the cause?

hello yegor,

it seems a critical use-case that you’re reaching thermal throttling zones, you may also refer to Thermal Specifications.
please also setup a serial console, you may leaving a terminal to gather kernel messages simultaneously to catch the logs before system reboot.
thanks