Jetson TX2 crashing due to Ethernet connection

I am currently using a TX2 board, and planning on setting it up along with a framos adapter and a few cameras for a project. I have been running into an issue since I got this board though. While the board is connected to my company’s internet network, it crashes multiple times a day, seemingly at random, from 2 minutes after a previous crash, up to several hours later.

Through the use of minicom, I have been able to capture a snapshot of the crash as soon as it had happened, through a serial interface. The crashes stopped immediately as I removed the ethernet cable.
Crash.xcf (701.6 KB)

My setup is just a TX2 board, with a framos FPA-4.A adapter, along with a framos fsm-imx530 sensor module, and finally with a Schneider Kreuznach lens. Usually connected to it are an HDMI Display, a USB hub for a keyboard and mouse, an ethernet connection, and the power adapter.

The board was previously flashed manually with an older version of Jetpack, but has recently been flashed to latest with the Nvidia SDK manager in hopes that the crashes would be fixed. It’s worth mentioning that the TX2i that we have been testing with does not experience the same issues.

Rather than a screenshot you should probably provide a full serial console log from boot and up to (and including) the actual crash. Serial console can keep a log on a host PC so this will not be harmed by the crash and will show not only the crash, but what leads up to the crash. See:
http://www.jetsonhacks.com/2017/03/24/serial-console-nvidia-jetson-tx2/

You might also add what the output of “lsmod” is prior to the crash.

Sorry for the delay, covid issues arose and I couldn’t be in the office to test.

As I was putting together this reply, a second crash had happened, so the second part of it should not contain any scripts run by me.

minicom.cap (221.3 KB)

Output of lsmod:

Module                     Size   Used by 

bnep                      18950   2
xt_conntrack               3979   1
ipt_MASQUERADE             2570   1
nf_nat_masquerade_ipv4     3993   1 ipt_MASQUERADE
nf_conntrack_netlink      33032   0
nfnetlink                  9716   2 nf_conntrack_netlink
xt_addrtype                3915   2
iptable_filter              3008  1
iptable_nat                 3423  1
nf_conntrack_ipv4          14158  2
nf_defrag_ipv4              2129  1 nf_conntrack_ipv4
nf_nat_ipv4                 8176  1 iptable_nat
nf_nat                     25020  2 nf_nat_masquerade_ipv4,nf_nat_ipv4
nf_conntrack              131705  6 nf_conntrack_ipv4,nf_conntrack_netlink,nf_nat_masquerade_ipv4,xt_conntrack,nf_nat_ipv4,nf_nat
br_netfilter               17460  0
zram                       29369  4
overlay                    52649  0
bcmdhd                    979535  0
cfg80211                  697380  1 bcmdhd
spidev                     14571  0
userspace_alert             6697  0
nvgpu                    1720761 20
bluedroid_pm               16123  0
ip_tables                  21475  2 iptable_filter,iptable_nat
x_tables                   38016  5 ip_tables,iptable_filter,ipt_MASQUERADE,xt_addrtype,xt_conntrack

Hi,

Is this case always accompanied with these i2c error before the crash happened? You can checked the error log you just pasted, and you shall see lots of i2c error there.

The i2c errors are unrelated to the crash, those are only there due to the EEPROM read/writes that I am doing as I am troubleshooting another issue, specifically about my project.

I think we may need you to do more test and see what is the scenario to hit this error.

First, please provide information

  1. Is this a TX2 devkit? Or custom board?

  2. Which jetpack release are you using?

  3. Is it possible to test TX2 in a environment that is not your company’s internet environment?

  4. Does every TX2 you have all suffer this error?

This is indeed a TX2 devkit. On the camera connector, there is a framos FPA-4.A module, along with a framos IMX530 Image Sensor Module, and then followed by the lens we use. Even after removing those, the crashes continue, and they are necessary for our project.

I am using Jetpack 4.5.1, the latest one through the Nvidia SDK Manager.

I have tried leaving the board overnight in two scenarios: No ethernet/WiFi connection; Ethernet connection to my work computer (the one used to flash it even); In neither case did it crash.

No. I have not tested with any other TX2’s personally, but my coworker with a TX2i is not running into any issues regarding crashes.

I think you didn’t try any switch or hub case here right? Looks like you only have a direct connection with your work computer.

No. I have not tested with any other TX2’s personally,

If you have other tx2, please also try. thanks.

It is connected to a network switch. Needed spare ethernet ports than just the one on the wall. Do you need information of that as well?

I wanted to first see if it could be solved on its own, before using another board. Due to other issues as well, I think it will be necessary anyway. Can you at least answer the first question here before I do that?

I think the most important thing I want to know if whether this issue happens to other TX2 or not.

If you take your error log to search over this forum, you will find out that there are almost 0 case as yours.

Thus, if other TX2 all have this issue in your company’s ethernet environment, then maybe we need to dump packets/traffic or add some print to the kernel and let you to debug since you are the only one that can reproduce this issue.

I will attempt this now, then, but will likely be able to reply on the matter in a few hours or only tomorrow

I am curious, since this is related to ethernet buffer issues, and because MAC address might be related to EEPROM, prior to this, can you see the MAC address without a crash from “ifconfig”?

1 Like

Some notes for what linuxdev is talking about.

The MAC addr for the native ethernet interface is read from the EEPROM. If somehow the board not able to read the eeprom through i2c, then the mac addr would be gone and the driver will just give you a random one.

As I was writing this message, I had just observed the TX2i also crash.

I was unable to get serial console output throughout the night on my coworker’s TX2i, but what seems worrying to me is the output of last reboot, which is:

thanh@thanh-desktop:~$ last reboot
reboot   system boot  4.9.201-tegra    Tue Aug  3 10:14   still running
reboot   system boot  4.9.201-tegra    Tue Aug  3 10:05   still running
reboot   system boot  4.9.201-tegra    Tue Aug  3 09:59   still running
reboot   system boot  4.9.201-tegra    Tue Aug  3 09:23   still running
reboot   system boot  4.9.201-tegra    Tue Aug  3 00:06   still running
reboot   system boot  4.9.201-tegra    Mon Aug  2 17:04   still running

wtmp begins Mon Aug  2 17:04:05 2021

I will be monitoring serial output and update whenever possible.

ifconfig output:

thanh@thanh-desktop:~$ ifconfig
docker0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 172.17.0.1  netmask 255.255.0.0  broadcast 172.17.255.255
        ether 02:42:26:d9:e7:f3  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1280
        inet 172.16.9.128  netmask 255.255.252.0  broadcast 172.16.11.255
        inet6 fe80::b545:700:ad56:fdc0  prefixlen 64  scopeid 0x20<link>
        ether 00:04:4b:f8:47:9c  txqueuelen 1000  (Ethernet)
        RX packets 22573  bytes 21125537 (21.1 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 11597  bytes 1297350 (1.2 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 41  

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1  (Local Loopback)
        RX packets 404  bytes 31314 (31.3 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 404  bytes 31314 (31.3 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

rndis0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether b6:3e:24:7c:4a:d1  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

usb0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether b6:3e:24:7c:4a:d3  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

wlan0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 00:04:4b:f8:47:9a  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

tx2i.cap (41.5 KB)
Attached is the TX2i serial log. Good timing.

Is there any application that is using ethernet when the error happened? For example, streaming.

Or just put it idle can hit this issue?

Not that I am aware of. The board was idle throughout the crash, and throughout the entire night as well. I had only just used chrome briefly to check what the command to see the last few reboots is.

I am little bit confused by the test you’ve tried yesterday, you said “it is connected on a switch”. Do you mean a local network that only has switch here and no connected to the ethernet port in your office?

host <-> switch <-> TX2

Our current setup is:
internal network <-> switch <-> TX2, work laptop, another coworker’s computer

Removing the switch from the equation would be difficult for me and the others.

That “internal network” means your office network environment, right?

Can you just bring your TX2 to other environment like your home and use the switch at home to see if this issue also happens? I guess this should not happen. Just want to prove that this is really related to office network environment.