Dead log possible?

Hi,

I would like to ask if it is possible to leave a log to determine why orin died. (Approximately 4 hours after prolonged power on the idle state)

Thank you

I assume you are saying it won’t boot. Have you tried a serial console boot log? If that also fails, then check the documents for your specific release and see if you can clone. The ability to clone certifies a basic part of the system still works (it doesn’t depend on what is flashed, but boot logging does), plus it gives you a copy of the rootfs (useful for examining its logs and for backup/restore).

Hi, NVIDIA

I am sending two log files as attached.
If the power is turned on for 2-3 hours, a brick-out phenomenon occurs and we are trying to find the cause.

syslog.txt (9.6 MB)
kernLog.txt (591.3 KB)

Thank you

linuxdev is not talking about the log you shared.

He is telling using this method to check the log from uart port.

Hi, WayneWWW

I’m using a custom board (I designed it myself)

I do not have the Debug MCU (ATSAMD21G16B-AU_P3737) implemented in the circuit, so I will try to log capture from the serial port of the UART3_TX/RX_DEBUG (H62, K60) pin.
This serial port is connected to the Debug MCU, but I did not use the Debug MCU and connected this serial port with a separate connector.
Can I capture the log through this?

Thank you

Then you may need to use usb-ttl cable instead.

Hi,

I’m using this cable.
So your opinion is that it is okay to receive logs with this Cale?

Thank you very much

Yes, you need to enable logs and monitor what got printed during crash happened.

I will say that although those are not the logs I was looking for, it does make me suspicious of some sort of network device, e.g., a cellular modem. Is there some sort of special network device involved? Is it USB? Getting a serial console boot log would really help, but meanwhile, can you describe any customization related to USB or network device design? I ask because of some unexpected kernel stack traces and USB errors which are not common.

Related to this, it looks like the kernel is stock and does not use any special configuration. Can you confirm any difference between the default kernel config (e.g., via “tegra_defconfig”) and what you use? Any extra modules? Any device tree changes?

Hi,

I assembled a total of 3 orin custom boards at the sample stage.
I am currently debugging these 3 boards and have not yet applied the DT file to test essential functions.
4 Intel I210 are used as an Ethernet controller to provide the Giga Ethernet function of 4 ports per board.
Still, this driver has not been applied yet, so a USB-to-Ethernet converter is used on one of the three assembled boards to provide temporary Ethernet access. Your opinion indicates that this converter cable may be causing the problem.

Currently, the brick-out appears only on boards using this cable after about 3 to 4 hours term, so I assume that the brick-out matches your opinion.

  1. I will try a long-term test after removing this cable.
  2. The serial log is attached below

SerialLog (UART3).txt (95.5 KB)

The default config is being used without modification. (35.4.1)

Thanks for your help.

You will probably need a valid device tree, but it is hard to say if that’s the actual problem. There are two kinds of firmware to consider:

  • Firmware on the Jetson to find and set up a match of drivers, hardware, and arguments or environment to pass along to the driver.
  • Firmware which loads into a device.

The latter form of firmware basically changes the device itself, e.g., its API and ABI might change when accessed. This is quite common on Wi-Fi or other wireless devices due to the ability to create one set of hardware which follows different government mandated wireless regulations throughout the world. One could create and ship hardware dedicated to each part of the world, or one can create firmware to make the device comply with that part of the world. The latter is easiest and least costly. Using a driver with the wrong hardware (and having the wrong ABI/API due to missing or incorrect firmware causes this) will quite often cause an unrecoverable error that brings the system down.

The former form of a device tree goes to the operating system and pairs a driver with the details (such as physical address and setup of GPIO pins) needed for that driver to support the hardware. Once again, if the physical address of access is wrong, then there is no telling what the commands for working with a driver to a device will do. Often this just leaves a device as non-working, or perhaps just one feature won’t work. It is possible though for this to bring a system down.

If USB works, then devices talking through USB typically don’t need a device tree setup. It is still possible though that a USB device which requires firmware to be loaded into it might still have problems if its firmware is not available, but it won’t be due to device tree (the uploading of firmware to the device needs to be triggered, but the device tree is not involved with upload to a USB device; this might be the realm of a udev trigger which works with USB hot plug).

My thought is that you need a correct device tree to find out. If the USB device uses firmware, then this also needs to be validated (not all USB network devices use firmware). Anything which is wireless, especially cell devices, greatly increase the odds of needing firmware to be loaded into the device itself.

Is there any way you can apply device trees and/or device firmware before testing? There are so many errors related to this that I don’t think there is much chance of debugging without it.

EDIT: Wired network devices rarely use any form of firmware upload to the device. Wireless devices do tend to use this.

1 Like

Thank you for your detailed explanation of my problem.

I will apply the DT file first for future debugging purposes.
I will also test it without the USB-to-Ethernet cable connected.

Hi,

As you pointed out, after removing the USB-to-Ethernet cable, it operates normally and the Bick-out does not happen anymore.

I appreciate your help.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.