Jetson Xavier NX - Determining cause of SHUTDOWN_REQ

System Details

Platform: Jetson Xavier NX Module
Firmware: Jetpack 4.4.1
Carrier: Custom imaging platform

Issue

Our Jetson Xavier NX Module has been shutting down under load. This primarily occurs while inferring, but we’ve also been able to reproduce the issue with CUDA samples such as matrixmul.

These shutdowns were triggered by SHUTDOWN_REQ, which from my understanding is typically caused by one of the following 3 scenarios:

  1. Software shutdown (ex. sudo reboot)
  2. Thermal shutdown
  3. VDD_IN < 4.1V

When these shutdowns occur, our software hasn’t requested a software shutdown, temperatures are well below the thermal shutdown threshold, and VDD_IN hasn’t dropped below 4.1V.

The trace below shows VDD_IN fluctuating during an inference, then dropping significantly due to a SHUTDOWN_REQ


I attempted to check the pmic-reset-reason register, however both of the following files were empty:

  • /proc/device-tree/chosen/reset/pmic-reset-reason/reason
  • /proc/device-tree/chosen/reset/pmic-reset-reason/register-value

I couldn’t find a reset reason in the bootloader logs either:

  • rst_source : 0x0
  • rst_lvl : 0x0

How could I determine the cause of the SHUTDOWN_REQ?

Hi, do you have log file of this? Have you checked the real time status by command line “tegrastat”? Have you tried capturing voltage drop with oscilloscope on VDD_IN?

do you have log file of this?

I’ve attached the resulting syslog + tegrastats logs. The crash occured at timestamp 19:35:45 in the syslog.

Have you tried capturing voltage drop with oscilloscope on VDD_IN?

I’ve checked it with a logic analyzer. The original post includes a screenshot of the logic analyzer output, however please advise if another format would be more useful.

I don’t currently have access to an oscilloscope, but I could purchase one if my logic analyzer is not providing the required output.

tegrastats_test2.txt (624 KB)
syslog_test2.txt (336.7 KB)

Hi,

If you want to check that log from software, please check the log from serial console. Syslog is not able to record such log.

The method to do that on devkit is based on this page.

If you want to check that log from software, please check the log from serial console.

Are you referring to the reset reason? If so, I used the serial console to get the rst_source and rst_lvl logs provided in my original post (see below). Do these provide any insight?


I did find a reference to a reset reason in the Jetson Linux Driver Package Software Features documentation, however it doesn’t appear in my own bootloader logs.

I’ve attached my bootloader logs for reference. These were recorded a few months ago when the device rebooted after a crash.

bootloader_post_crash_2.txt (78.2 KB)

Hi, the reset reason is TEGRA_POWER_ON_RESET sounds like a a normal reboot. It looks more like a power supply issue. Have you checked the power supply capability? And if any voltage drop observed on the power supply? You can use an oscilloscope to capture if any such voltage drop. Or you can try another power supply with higher supply capability.

Hi Trumany,

Thank you for following up. Please see my two part response below.

Power supply investigation

When this problem first occured, I also suspected the power supply to be the cause, however, I was able to rule this out via the following troubleshooting process:

  • First, I switched to a 12V 84A power supply and tested again, but the crash still occured.
  • During the crash, I measured the output of our TPS54561 regulator which directly sources our SoM power sequencer (shown below). It remained at ~4.9V despite the SoM shutting down.
  • I then checked PWR_EN and found that it was 0V. This indicated that power was being disconnected from the SoM, even though the power supply remained steady.
  • Only two inputs remained that could cause the system to shut down; SHUTDOWN_REQ, and PWR_GOOD. PWR_GOOD is normally sourced from the TPS54561 regulator, however the resistor (R68) is unpopulated, so it has no effect. This indicated that SHUTDOWN_REQ must have been the cause of the shutdown.
  • I checked SHUTDOWN_REQ and confirmed that it was being pulled low during the crash.

This investigation confirmed that the power supply was not failing, however, it was possible that the Jetson PMIC was detecting unstable power and asserting SHUTDOWN_REQ as a precaution.

To improve stability under heavy loads, I added 3 x 300μf Tan-poly capacitors to our SoM input voltage rail. The SoM was able to remain powered for a few additional milliseconds, but SHUTDOWN_REQ was still asserted eventually, causing a shutdown.


To summarize, the crash was not caused by loss of power from the power supply, but rather from SHUTDOWN_REQ being pulled low and disabling output from our SoM power sequencer. This was what prompted me to inquire about the cause of the SHUTDOWN_REQ assertion.


Power down timing

the reset reason is TEGRA_POWER_ON_RESET sounds like a a normal reboot.

If SHUTDOWN_REQ was pulled low, but TEGRA_POWER_ON_RESET indicates a normal reboot, does that imply that the TEGRA_POWER_ON_RESET value isn’t being updated correctly?

Figure 5-5, page 16 of the Jetson Xavier NX Product Design Guide shows that VDD_IN must be allowed no less than 10ms to decay from 5V to 3V, however, as shown below, this is occuring in less than 2ms in our system.

Jetson Xavier NX Product Design Guide:

Figure 5-5, page 16. Power Down (Sudden Power Loss)
image

Our system

image

Perhaps the design isn’t allowing enough time for the shutdown procedure to occur, resulting in failure to update the value of TEGRA_POWER_ON_RESET. Do you think this could be the case?

It could be the 2ms issue as it is requested to >10ms as you listed.

Is it possible to validate this on devkit? If it can’t repro on devkit then we can narrow down it to the board design.

It functions correctly on the devkit. I will follow up with our hardware partners to discuss changes to the board design. Thank you very much for your help.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.