Wired Network (eth0) keeps cutting out

luna_devnvidia · May 19, 2021, 8:16pm

Hi,

I stumbled on some fairly odd behaviour.

After about 2 days, the device started to randomly be not available.

I installed JetPack 4.5.1 and use it mainly headless (no monitor attached).

I tried with adding a /etc/network/interfaces file stopping NetworkManager.service and wpa_supplicant.service which only seems to have made the cutting out and reconnecting more frequent. But it didn’t solve it.

Most frustrating of all is that I cant seem to find any logs regarding this. journalctl, dmesg and logs in /var/log/ don’t seem to have any information when the connections cuts out. ip addr show eth0 also never drops the address when the network is down (logged in over the serial usb).

I’m truely sorry, but I don’t think that I can provide any logs for this. I’m also running wireguard and docker-nvidia, but I highly doubt that they can distrupt the entire network connection. I’m using the eth0 interface with dhcp (didn’t change anything network related) and the network cable in the exact same spot was used for another device a long time and never had any issues.

I’m at my whits end here. Is there something to be aware when running the device headless that could justify the device sometimes just not being available anymore.

I would also be perfectly happy to ditch NetworkManager and wpa_supplicant. Is there more to it than stopping the services, adding a basic /etc/network/interfaces file and running ifdown -a && ifup -a?

luna_devnvidia · May 19, 2021, 9:45pm

After a reboot, I noticed that without NetworkManager, no ipv6 worked (even when iface eth0 inet6 dhcp was set).

Added a USB-Ethernet-Adapter, switched to NetworkManager again and rebooted with the kernel not having the “quiet” cmdline param. After that I added a connection for the second card to network manager to have both get a IP.

After a few minutes, pings to the ip on eth0 failed again. The USB-Card still worked though. I’ll investigate over the next days whether using a USB-Adapter is the solution here. Maybe my devkit has some kind of manufacturing defect on the internal ethernet port. Or it’s just a weird driver issue.

WayneWWW · May 20, 2021, 3:24am

I am not sure why you cannot share a log. A simple dmesg can at least let us know whether this is from low level driver or userspace tool.

BTW, is flashing the device a available option here? It feels like you’ve installed lots of things and probably you don’t even remember what you’ve configured.

luna_devnvidia · May 20, 2021, 10:47am

Much of my setup was from a previous device. I know of most things, but the fact that I added everything at once makes debugging certainly harder.

Yesterday I have experimented with jellyfin and jetson-ffmpeg. I’ve gotten it do basic NVEnc by passing through some more libraries and replaceing fffmpeg with the community jetson version. I noticed that the network often cut out when down’ing the docker-compose stack. But usually only when I had pid: host in it (privileged: true may also play a role).

So my current guess is that somehow docker manages to kill the network when e.g. destroying a network and removing bridges or iptables rules.

In this case it also seemed that somehow systemd services got killed. I got ssh usually up again by restarting the service, which just makes me more confused whether this is a network or service issue (pings certainly cut out as well earlier).

For now I’ve installed docker-ce and will look if this improves things.

I also have watchtower installed which automatically updates containers. So that might have caused some of the cut outs automatically.

Here is part of my dmesg log. I don’t consider it useful since this seems never seemed to be in a relatable timeframe with the cut outs. Those logs were sometimes hours old before a problem appeared (the problem itself didn’t cause any logs and journalctl didn’t show anything related either).

[48697.025423] br-f57d8dd70ea1: port 1(vethc3d533f) entered blocking state
[48697.025432] br-f57d8dd70ea1: port 1(vethc3d533f) entered forwarding state
[48701.299238] br-f57d8dd70ea1: port 1(vethc3d533f) entered disabled state
[48701.299400] vethc836e25: renamed from eth0
[48701.362247] br-f57d8dd70ea1: port 1(vethc3d533f) entered disabled state
[48701.366242] device vethc3d533f left promiscuous mode
[48701.366252] br-f57d8dd70ea1: port 1(vethc3d533f) entered disabled state
[48761.237340] br-f57d8dd70ea1: port 1(vethd70d527) entered blocking state
[48761.237348] br-f57d8dd70ea1: port 1(vethd70d527) entered disabled state
[48761.237513] device vethd70d527 entered promiscuous mode
[48761.237658] IPv6: ADDRCONF(NETDEV_UP): vethd70d527: link is not ready
[48761.933843] eth0: renamed from vethcc29d99
[48761.957266] IPv6: ADDRCONF(NETDEV_CHANGE): vethd70d527: link becomes ready
[48761.957568] br-f57d8dd70ea1: port 1(vethd70d527) entered blocking state
[48761.957576] br-f57d8dd70ea1: port 1(vethd70d527) entered forwarding state
[48766.240318] br-f57d8dd70ea1: port 1(vethd70d527) entered disabled state
[48766.240919] vethcc29d99: renamed from eth0
[48766.304740] br-f57d8dd70ea1: port 1(vethd70d527) entered disabled state
[48766.308852] device vethd70d527 left promiscuous mode
[48766.308863] br-f57d8dd70ea1: port 1(vethd70d527) entered disabled state

EDIT: As for flashing: I would like to avoid it. I have tested a lot with the toolchain before moving stuff over. I certainly can flash the bootloader/kernel. It shouldn’t affect my rootfs since it’s on a external ssd.

luna_devnvidia · May 21, 2021, 3:33am

I think I narrowed the problem somewhat down.

Switching to docker-ce (ppa) didn’t change it. But I think it being a problem of downing a jellyfin container that is privileged somehow crashes systemd.

SSH:

screenshot3697

Serial:

I don’t seem to find any systemd logs about that (it crashes ssh and numerous other services). Only one line over the serial connection, where I get logged out as well at that point.

Below is the docker-compose.yml I used. Most of it is commented out (did sucessfully get jellyfin to use nvenc with it). Using runtime: nvidia and pid: host don’t seem to be the problem as I thought.

docker-compose.yml

version: "2"
services:
  jellyfin:
#    image: linuxserver/jellyfin:arm32v7-latest
    image: linuxserver/jellyfin:arm64v8-latest
    labels: [ 'com.centurylinklabs.watchtower.enable=true' ]
    runtime: nvidia
    environment:
      - PUID=1000
      - PGID=1000
      - TZ=Europe/Berlin
      - NVIDIA_DRIVER_CAPABILITIES=all
      - NVIDIA_VISIBLE_DEVICES=all
    ports:
      - "127.0.0.1:84:8096"
    volumes:
      - ./config:/config
#      - ./ffmpeg_nvmpi:/ffmpeg_nvmpi
#      - /usr/local/lib/libnvmpi.so:/usr/local/lib/libnvmpi.so:ro
#      - /usr/local/lib/libnvmpi.so.1:/usr/local/lib/libnvmpi.so.1:ro
#      - /usr/local/lib/libnvmpi.so.1.0.0:/usr/local/lib/libnvmpi.so.1.0.0:ro
#      - /usr/lib/aarch64-linux-gnu/libasound.so:/usr/lib/aarch64-linux-gnu/libasound.so:ro
#      - /usr/lib/aarch64-linux-gnu/libasound.so.2:/usr/lib/aarch64-linux-gnu/libasound.so.2:ro
#      - /usr/lib/aarch64-linux-gnu/libasound.so.2.0.0:/usr/lib/aarch64-linux-gnu/libasound.so.2.0.0:ro
#      - /usr/lib/aarch64-linux-gnu/libxcb-shape.so:/usr/lib/aarch64-linux-gnu/libxcb-shape.so:ro
#      - /usr/lib/aarch64-linux-gnu/libxcb-shape.so.0:/usr/lib/aarch64-linux-gnu/libxcb-shape.so.0:ro
#      - /usr/lib/aarch64-linux-gnu/libxcb-shape.so.0.0.0:/usr/lib/aarch64-linux-gnu/libxcb-shape.so.0.0.0:ro
#      - /usr/lib/aarch64-linux-gnu/libxcb-xfixes.so:/usr/lib/aarch64-linux-gnu/libxcb-xfixes.so:ro
#      - /usr/lib/aarch64-linux-gnu/libxcb-xfixes.so.0:/usr/lib/aarch64-linux-gnu/libxcb-xfixes.so.0:ro
#      - /usr/lib/aarch64-linux-gnu/libxcb-xfixes.so.0.0.0:/usr/lib/aarch64-linux-gnu/libxcb-xfixes.so.0.0.0:ro
#      - /usr/lib/aarch64-linux-gnu/libEGL.so:/usr/lib/aarch64-linux-gnu/libEGL.so:ro
#      - /usr/lib/aarch64-linux-gnu/libEGL.so.1:/usr/lib/aarch64-linux-gnu/libEGL.so.1:ro
#      - /usr/lib/aarch64-linux-gnu/libEGL.so.1.0.0:/usr/lib/aarch64-linux-gnu/libEGL.so.1.0.0:ro
#      - /usr/lib/aarch64-linux-gnu/libGLdispatch.so:/usr/lib/aarch64-linux-gnu/libGLdispatch.so:ro
#      - /usr/lib/aarch64-linux-gnu/libGLdispatch.so.0:/usr/lib/aarch64-linux-gnu/libGLdispatch.so.0:ro
#      - /usr/lib/aarch64-linux-gnu/libGLdispatch.so.0.0.0:/usr/lib/aarch64-linux-gnu/libGLdispatch.so.0.0.0:ro
#      - /etc/ld.so.conf.d/nvidia-tegra.conf:/etc/ld.so.conf.d/nvidia-tegra.conf:ro
      - /media/jellyfin-test:/data
#    privileged: true
#    pid: host 
#    devices:
      # VAAPI Devices
      #- /dev/dri:/dev/dri
      # RPi 4
      #- /dev/vchiq:/dev/vchiq

# TODO FOR HW-ENC:
# Other resolution might not work
# - chmod +s ffmpeg_nvmpi/ffmpeg
# - apt update && apt-get install libgstreamer1.0-0 gstreamer1.0-tools gstreamer1.0-plugins-good gstreamer1.0-plugins-bad gstreamer1.0-plugins-ugly gstreamer1.0-libav libgstrtspserver-1.0-0 libjansson4

I currently can’t reproduce the weird ping cut outs, but guess that it was partially caused by systemd crashing.

My guess would be that my watchtower container updated the container at some point (and thus downing it) which caused the failure and the pings where the aftermath of the system dying halfway.

I pretty much hope this to be the case. Maybe it’s some weird interaction of docker and the tegra kernel.

luna_devnvidia · May 23, 2021, 1:03am

It seems to have been that container indeed when it had privileged rights.

Not sure why still, but as said above, I guess that the network issues were because of basically systemd crashing as a result. The system was probably unstable at that point.

It ran fine for 2 days now with the default ethernet connector of the devkit.