TX2 HDMI, eth, USB stops working after sometime

Hi,

So I’m working with a TX2 devkit that stops working after a about a week of running perfectly. What seems to happen is that the HDMI, ethernet and USB ports stop working. I can only access the board using the UART pins. Re-flashing using Jetpack (4.2.2) fixes the problem but it comes back after a while (seems like a week). The display just shows an Nvidia logo, the Ethernet lights aren’t blinking on the connector, and my USB keyboard doesn’t seem to provide any input. What could be the problem?

https://pastebin.com/YMvd9g9S
I’ve attached my dmesg output

Your firewall, UFW, is blocking some networking. This is probably unrelated, but worth looking at first since it is something which shouldn’t be there. The source of the network connection is “lo”, or localhost. The destination is never reached. The MAC address being refused:

00:00:00:00:00:00:00:00:00:00:00:00:86:dd

This looks a bit odd due to length, but perhaps this is something related to IPv6 which I’m not familiar with (IPv4 and IPv6 addresses look different, but I’ve not seen MAC addresses look different…the software reading MACs for IPv6 may actually format differently). On the other hand, perhaps this is a hardware issue (is this a development kit, or a custom board?). If you run “ifconfig”, do you see “00:00:00:00:86:dd” in any of the MAC addresses?

Also, I see some USB suspend, so perhaps USB just isn’t waking up.

Hi,

I do have firewall settings to block everything except ssh.

eth0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 00:04:4b:c7:02:8e  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 41

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1  (Local Loopback)
        RX packets 3152  bytes 194479 (194.4 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 3152  bytes 194479 (194.4 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

rndis0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 1a:4a:81:2f:93:99  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

usb0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 1a:4a:81:2f:93:9b  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

wlan0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 00:04:4b:c7:02:8c  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

I don’t see the MAC address in ifconfig.

I can disable the firewall. The board is the standard Jetson TX2 devkit

Do you have serial console available? Info:
http://www.jetsonhacks.com/2015/12/01/serial-console-nvidia-jetson-tx1/

I am thinking that the issue might be USB auto-suspend. You might try this with the mouse and keyboard (this won’t stick around after boot, but if you want to make it permanent, then there are methods to do so on the kernel command line). Before you start, look at the following:

lsusb

Each USB device has a combination of product and device ID. For example, I have a cheap keyboard which shows up with “ID 04d9:1603”. Write this down for both mouse and keyboard. After this you may want to “sudo -s” to stay in a root shell, but most access which follows requires sudo.

Note that keyboards and mice are “HID” devices (they use the Human Interface Device generic drivers). cd to “/proc/bus/input/”. Look at the contents (there are many ways to do this). For example:

cd /proc/bus/input
less -i devices

Then search for “keyboard” or “mouse”. You will see a “Sysfs=” entry which identifies where the devices are in terms of “/sys” controllers. For example, I have this with a keyboard:

I: Bus=0003 Vendor=04d9 Product=1603 Version=0110
N: Name="  USB Keyboard"
P: Phys=usb-3530000.xhci-2.4/input1
S: Sysfs=/devices/3530000.xhci/usb1/1-2/1-2.4/1-2.4:1.1/0003:04D9:1603.0002/input/input4

Note that earlier I said my lsusb ID was “ID 04d9:1603”. The first part is a vendor, and matches, the second part after the ‘:’ is the product, and this matches. Now cd to the Sysfs part (substitute for what your mouse or keyboard is, but you have to prepend “/sys/”):

cd <b><u>/sys</u></b>/devices/3530000.xhci/usb1/1-2/1-2.4/1-2.4:1.1/0003:04D9:1603.0002/input/input4

Verify this is the correct device via “cat name”. Now cd to “power/”. Verify this is currently in “auto” mode:

cat control

Now, for this boot only, temporarily force this always on:

echo 'on' > control
# Verify now "on":
cat control

Do the same thing with both keyboard and mouse. Now let this keep running long enough to see if you still have the same issues. If this does the job, then two things can occur. First, NVIDIA might be able to help debug. Second, this is a workaround, and auto suspend could be disabled via other methods which persist across reboots.

If this does not help, then you will want to leave a serial console running and monitor the Xorg log:

sudo tail -f /var/log/Xorg.0.log

(there may be cases where it is “Xorg.1.log”…“ls -ltr Xorg*” will list logs in reverse chronological order…the last one to list is the active log after a boot)

FYI, X uses the USB subsystem (along with udev) to do the work of identifying input devices. My thoughts are that probably USB is not failing so much as something is interfering with X picking the devices back up after a suspend. The tail of this log will make notes about a suspend operation, and then again upon wake. Assuming you test without disable of auto suspend, then I’m thinking you will see a suspend operation, but no wake operation, and probably no actual errors. You could do the same with “dmesg --follow”, but you can’t do both on the same serial console (well, you could, but it would not be worth it).

Side Note: You probably do not want to block traffic on “localhost/loopback/lo/127.0.0.*”. This is traffic which is not going to the outside world. Traffic which is purely on that interface should be whitelisted.

Thanks for the help linuxdev, I tried out these steps and it didn’t help. Looking at the Xorg log, I don’t see any issues with suspend (I did a search in

/var/log/Xorg.0.log

and didn’t find any suspend related to USB devices)

It’s not just the USB that’s not working though, HDMI output isn’t (my HDMI monitor just shows an Nvidia logo and doesn’t bring up a login screen) and despite switching off my firewall, I can’t ssh into my Jetson board (I get a connection refused ). I did notice that systemctl is in a degraded state due to nvzramconfig.service failing to run. Could this be the problem ?

$ sudo systemctl --failed
● nvzramconfig.service loaded failed failed ZRAM configuration

It looks like my board isn’t ok

Do you have a second module and/or carrier board you can swap as a test?

I am not familiar with the nvzramcomfig.service, someone else may have an idea on that, but a compressed RAM filesystem failing could in theory corrupt everything.

What test would you suggest ?

Digging deeper I’m noticing that the zram.ko file is missing from

/lib/modules/4.9.140-tegra/kernel/drivers/block/zram/

. I’m going to try copying in the file and seeing what happens

I am actually looking at a version 32.1 release (you’re using 32.2), and do not see anything related to zram. The “uname -r” you have is correct, and the module location is what I would expect. Maybe someone else knows about the zramconfig module. However, is this kernel modified in any way? I wouldn’t expect the directory to be there unless the module were configured during the kernel build.

Sorry for late reply.

To debug such issue, please

  1. Try to dump the serial console log.
  2. If you cannot do (1), then please ssh or use HDMI to type command “dmesg” after this error happens.

Hi natrajk,

Have you managed to clarified the cause and resolved the problem?
Any result can be shared?

Hi,

I was on vacation, so I didn’t get too far. I gave up and just re-flashed the board and it seems to be working for now. I can say that putting back the zram.ko file did not work and I wasn’t able to force the module to load with modprobe (I got an exec error).

If I run into anything I’ll post an update here (the problem seems to show up after a week)

What kind of exec error? If it was a format exec error, then the module was probably compiled for the wrong architecture.