Xavier Device Random Reboot Issue – Debugging Help Needed

We are experiencing occasional reboot issues with Xavier devices in a large-scale deployment. Below are the details:

  • Failure Frequency: In a batch of 56 devices, 7 reboots occurred over a period of 30 days.
  • Possible Trigger: We found that a rapid 0.5V drop in input voltage can sometimes cause a reboot, but this is not confirmed as the root cause.
  • Log Analysis: System logs contain some abnormal messages, but these messages also appear on devices that run without issues, so they don’t fully correlate with the reboot events.
  • Hardware Design: We only use Xavier’s core components, and all external peripherals (including the power module) are custom-designed.

Additional Information:

  1. We conducted CPU and GPU load tests, but the issue could not be triggered.
  2. We printed the application’s worker ID changes over the course of an overnight idle period, hoping it could help with the analysis.
  3. Device info: Xavier AGX 32GB/JetPack 4.6.3

Questions:

  1. What are the potential causes of unexpected reboots in Xavier devices?
  2. Are there any recommended debugging steps to identify whether the reboot is caused by power issues, software crashes, or hardware failures?
  3. Is there a way to check if the PMIC or another subsystem is triggering the reboot?
  4. Can any software operations (not system commands, but regular applications) trigger a system reboot on Xavier devices?
  5. Could memory fragmentation possibly be causing this issue?

Any insights or recommendations would be greatly appreciated! I will attach system log screenshots for reference.

Hi,

You should enable serial log from UART and check what got printed before that reboot.

None of your question could be answered with current information. The log will tell more.

You really do need a serial console log rather than a screenshot, but it sounds like that might be difficult. Do you have ssh access? If so, then you could maybe get a dmesg and redirect it to a log file. If the unit is not responding, then the ssh method would require you to have a login running with “dmesg --follow”, and then the output could be copy and pasted from the host (the “dmesg” output would be on the host already, and so it wouldn’t matter that login has stopped). You might need to get creative.

I will ask something though: Do you have PCIe devices attached? Your screenshot shows PCIe bus errors, and AER seems to be “recovering”, but this becomes something of an infinite loop. Keep in mind that a PCIe bus error can be something transient or something from design issues or from hardware failure. Something as simple as being near industrial noise could cause issues.

Speaking of power delivery, Jetsons are somewhat sensitive to power stability. Perhaps so is the PCIe device? Or perhaps the PCIe device has a momentary increased need for power which loads down something on the power bus of the Jetson? I have no idea if this is power related, but such issues seem “reasonable” for this behavior.

Update on Xavier 32GB Device Random Reboot Issue

We have gathered additional information related to the unexpected reboots:

  1. During the reboot, the system log shows the following message:
tegra-pmc: ### PMC reset status reg: 0x2d  
  1. We have set up debugging via the serial console and will provide more details once we gather the output.
    • Note: The attached dmesg output is not from the time of failure. It was collected after removing the board and rebooting. However, we are sharing it in case it provides any useful context.

Could you please help analyze these logs and provide any insights into potential causes of the issue?

Thanks in advance for your support!

I see lots of filesystem orphan nodes, which means part of your filesystem is missing. This of course might be somewhat difficult to stop from occurring when it is a system failure forcing sudden stop, but I want to point out that at this point you cannot necessarily trust the filesystem. Some of it was deleted.

I don’t know if those PMIC values have any meaning or not. The PMIC is the power management IC, which is what brings up various power rails in the right order during boot. Cold boot and warm boot both reset this.

I see lots of i2c errors. Something i2c is causing problems, and it is at this address:
0x31b0000

Note that there seems to be HDMI or DisplayPort configuring, so perhaps (not definitely) this is the i2c associated with the monitor query. The DDC wire uses 3.3V i2c; it is the Jetson which has to bring up the power to the i2c circuitry over the HDMI or DisplayPort cable to enable that i2c circuitry. Mostly this query occurs when hot plug detect shows a plugin event, including power on and reboot or actually removing and adding the HDMI or DisplayPort cable to the connector. I’m guessing that the i2c issue is for the monitor hardware. Part of this is because of the timing of the messages, and another part is due to the 0x50 i2c address (the address is not definitive since there is more than one i2c bus). The data on this bus is known as the EDID (which is actually the EDID2 protocol).

Please note that the integrated GPU (iGPU) of a Jetson, which is wired directly to the memory controller, does not allow many of the non-EDID configuration methods. Without the i2c to query the monitor video configuration is going to fail. A PC would probably have other non-EDID modes it could try. Jetsons do have a fallback mode, but what seems odd to me is the apparent examination and rejection of many display modes. I’m wondering where that is coming from if EDID was missing.

Note that you will have various Xorg logs in “/var/log/”. Depending on the DISPLAY environment variable, there will be a log for each DISPLAY. Most of the time it is a DISPLAY of “:0”, which means this log:
/var/log/Xorg.0.log

You can find the most recent log (assuming it has booted enough to allow access, e.g., over serial console):
ls -ltr /var/log/Xorg.*.log | tail -n 1

That log might say something about HDMI/DisplayPort/EDID, but unless the i2c is fixed this won’t matter.

When modifying a carrier board’s video you have to be careful to migrate any i2c function or hot plug detect function in addition to any actual HDMI or DisplayPort changes. I suspect that this is one failure point.

So far as PCIe goes, I did not see the error reporting. You mentioned:

 after removing the board and rebooting.

…however, I don’t know which board you removed. Was the meaning of this that a PCIe board was removed?

Update on Random Reboot Issue
I have checked all available logs but could not find any additional useful error messages. However, I noticed that after each reboot, the system log contains the following message:

tegra-pmc: ### PMc reset status reg:0x2d

Based on my research, this indicates that the reboot was triggered by software. If a hardware power cycle occurs, the reset status register should consistently be 0x00.

To further investigate, I wrote a test program that can reproduce a system crash:

#include <iostream>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>

#define PHYS_ADDR 0x50000000  // Possibly related to GPU or IOMMU
#define MAP_SIZE 4096

int main() {
    int fd = open("/dev/mem", O_RDWR | O_SYNC);
    if (fd < 0) {
        std::cerr << "Failed to open /dev/mem" << std::endl;
        return -1;
    }

    void *mapped_mem = mmap(nullptr, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, PHYS_ADDR);
    if (mapped_mem == MAP_FAILED) {
        std::cerr << "Memory mapping failed" << std::endl;
        close(fd);
        return -1;
    }

    volatile uint32_t *ptr = (volatile uint32_t *)mapped_mem;
    std::cout << "Reading memory: " << *ptr << std::endl;

    *ptr = 0x12345678;

    munmap(mapped_mem, MAP_SIZE);
    close(fd);
    return 0;
}

This suggests that an application may be triggering the system crash. Given this, I have a few questions:

  1. Are there any known issues in JetPack 4.6.3 that might cause unexpected system crashes?
  2. Can improper user-space application code, similar to the example above, cause memory access violations that lead to a system reboot?
  3. My system involves a high volume of log writes to disk (simultaneously writing over 100 log and data files, with a total data volume of ~28GB per hour). Could excessive logging activity contribute to system instability?

Looking forward to your insights. Thanks!

To clarify, when I mentioned “removing the board and rebooting”, I was referring to performing a hard power cut (sudden power loss) on the entire Xavier module, not removing a specific PCIe device.

Let me know if you need further details. Thanks!

Hi,

As I mentioned, you should check your serial console log and see what got printed before the reboot.

Checking the PMC log won’t help because you won’t find out the true error that leads to that thing in PMC.

For example, if it is software reset, then the problem is how to locate which software triggered that. For such case, keep studying PMC or guessing from the userspace code might not be sufficient.

The uart console log will print the stack dump of kernel if this issue is triggered by kernel panic. And that one might give out a hint.

The filesystem is being damaged by the hard, sudden loss of power. Proper shutdown is the only way to avoid that. Technically, you could take a dramatic performance decrease and run a synchronous filesystem mount, along with an extreme loss of solid state storage lifespan. Under current conditions though I have to wonder, what is it you are trying to achieve? If you are trying to make a device tolerant of loss of power, then you have to have some sort of battery backup or supercapacitor backup capable of a fast “normal” shutdown. If the loss of power was accidental, and you are trying to recover, then you could clone and try to piece things back together on a host PC using the clone.