I think IGB would refer to an Intel ethernet device. The watchdog timer part says an interrupt was issued for the device to service it, but failed to respond (it was a kernel space issue while servicing the driver). It is hard to say anything more useful. You might consider adding a serial console boot log (what happens prior to this matters since it sets up the environment the driver loads in to), along with the following:
Can you verify this is a dev kit, versus a module plus third party carrier board?
Which JetPack or L4T release is this?
If this is an SD card model, which SD card image is used, and was the Jetson itself flashed with that release (there is QSPI memory used in boot which would affect the Intel ethernet)?
If you have ever changed the device tree or kernel, then the nature of the change would be good to know (and if stock, then that too would be good to know).
Thanks for your reply!
We are using a carried board named “Boson for FRAMOS Carrier Board” with stock kernel and device tree from Jetpack 4.6.1.
It indeed contains a secondary i210 pcie ethernet port which was plugged in and in use before the crash occured.
I will try to upgrade to 4.6.2 and install their BSP in case they made any modifications to the kernel regarding this issue and report back with more information if the issue still occurs.
If the third party carrier board has the same exact lane routing, then you won’t need a new device tree. However, if anything is different, then you will need a new device tree (which can affect the BMP). If the secondary i210 is related to a PCIe lane setup which is not an exact duplicate of the dev kit, then this too would cause a need for a device tree edit.
Hello,
So I flashed the latest stock jetpack 4.6.2 kernel, and the second ethernet interface was automatically detected.
I saw in dmesg the following error after a while, still related to the igb driver:
I am curious, if you run “ifconfig”, then at the bottom of the particular device’s listing there will be listed an “interrupt”. Which interrupt is it? I’ll use “37” for an example since I see this on one NX. Then run the following command and show what it outputs (adjust the 37 to be whatever IRQ it really is):
(there are some spaces in there, so use mouse copy and paste if you can)
The reason I ask this is that the kernel OOPS shown has an ethernet error, but it is a non-critical software IRQ, not a hardware IRQ. A hardware IRQ is triggered by actual wires to/from the ethernet card. A certain amount of code runs, and optionally, a driver can then segregate out driver code which does not directly talk to the ethernet card (triggering the software IRQ). This is good practice to divide only the minimal hardware IRQ from other tasks in the kernel which can be performed after releasing the hardware interrupt from the physical device. This is especially true on a Jetson since most hardware can only talk to CPU0 (there is a wire required to talk to the device…for all CPU cores to be accessible one would need either a wire or a programmable device to distribute to all of the other cores). I am curious if the hardware IRQ itself shows anything unusual. I would expect the device tree to be related to the hardware IRQ drivers, but not to the software IRQ drivers. Not sure if this matters, but it is easy to look at.
Note that the watchdog says the driver just did not respond. This could be from a fault in the software driver. However, it could still be a problem with the hardware driver, e.g., suppose the software driver stopped while waiting for data? If that were the case, then the software driver might be ok and simply starved due to a hardware driver issue.
Hello, thanks for your reply!
I don’t see any interrupt when I type ifconfig eth1 (which is the interface linked to the igb driver), only for eth0. here is the output:
This is definitely the Intel ethernet driver failing (though it could be due to underlying hardware which the driver talks to) and timing out. I’ve heard of other Intel IGB driver issues, though I can’t recall what issues people had run into.
In the original kernel panic it is servicing the interrupt for that device, and the device does not respond. In the second case the OOPS fails with the “mavros_node” in user space instead of at the driver, but you can see that directly following the non-fatal OOPS that the IGB Intel NIC driver again is part of this, but it gets away with resetting the NIC, so I think it is again the NIC and/or NIC driver at issue.
The ifconfig output suggests that the NIC/driver combination is capable of working and shows many bytes both sent and received without any kind of error, and so it is hard to say what is going on. Does this always happen, or is there some condition which seems to trigger this? I’m wondering if for example this occurs under heavy load (which might indicate trouble delivering power to the NIC).
An ethernet cable is plugged from eth1 to a sensor which acquires ~3MB/s of UDP data per second. There is little transmisison to the sensor (some bytes per second).
Let us know if it works. If not, then I suggest flashing L4T R32.7.2 (which is the most recent; whatever IGB driver is there is likely matching the rest of the kernel and more debugged).
If I understand well CPU: 1 PID 839 etc simply tells what the cpu was doing at the time of the panic, I wonder if the process name (which is always mavros_node) can be related to the panic?
This is with the latest L4T R32.7.2 with latest intel igb drivers.
I will try to continue the tests without this ‘mavros_node’ to see if we still have a backtrace. Thanks for your help!
We basically know that the CPU stopped responding and the watchdog timer triggered. We also know that the IGB driver is related. Quite possibly the IGB issue is from something the mavros_node did, but we don’t really have any details. For example, perhaps IGB has a bug under certain data conditions, and mavros_node sends that data; but perhaps IGB is perfectly fine, but somewhere mavros_node itself fails while sending to IGB, and indirectly times out IGB. We don’t know.
Someone from NVIDIA may be able to say more if they can reproduce this, but you’ll need to give information for reproducing. Give the exact make/model of Jetson, what release is flashed to it, what software has been added (e.g., for Mavros), and what hardware is added (e.g., network cards, even keyboard/mouse). Basically anything which might allow NVIDIA to cause this to occur for them while watching.
Thanks for the explanation!
We have another device on the PCIe port, a wireless M2 card. Could they draw too much power on the pcie port, causing timeouts? Is there a way to check the current pcie power drawn?
Other devices, including PCIe, can cause timeouts and other failures in other hardware and/or drivers. I don’t see any indication of a power failure issue though. Certainly there are times when power draw does cause instability, but I don’t think this would normally cause this kind of timeout without some other logged error (it could, I just don’t think it is probable). The driver to the m.2 card is more likely to be an issue than is the power draw, but we also cannot confirm that.
The reality is that it would be best if NVIDIA had that exact m.2 PCIe card to install and test with. Perhaps they have one available, but you’d need to provide the exact model, and if you installed a driver, then give details of exactly where the driver is from (e.g., a driver downloaded separately and compiled and added as a module, versus a driver compiled and installed from the NVIDIA-provided kernel source, so on). Even if it is just an exact model with a known driver which is not available for others to debug it is possible to find reports related to that driver on the Internet, which would be a good clue. Details for recreation are important.