Device suddenly not reachable via Ethernet during stress testing on TX2/TX2-NX

We are using TX2 and TX2-NX based custom boards.
With L4T-R32.7.3 firmware, during the stress test of reboot, we observed that the device is unreachable over the network as eth0 is down. This issue is reproducible in both TX2 and TX2-NX devices.

We can easily reproduce the issue using following command which will bind/unbind the eqos driver in loop.

x=0; while `echo 2490000.ether_qos > /sys/bus/platform/drivers/eqos/unbind; sleep 0.2; dmesg -C; echo 2490000.ether_qos > /sys/bus/platform/drivers/eqos/bind`; do sleep 4; dmesg; echo REBOOT_$((x=x+1)) $(date +%T); done &

In faulty states it shows eth0 link is not getting up. please see below logs. This took from ~200 to ~2000 loop iterations to reach the faulty state, so it seems to happen at random.

   root@ovp81x-68-26-d3 (productive):~# dmesg | grep eth0
   [13030.833213] eqos 2490000.ether_qos eth0: Link is Up - 100Mbps/Full - flow control off
   [13030.833956] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
   [13034.158468] net eth0: get_configure_l3v4_filter -->
   [13034.159727] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready

After that, we tried using the L4T-R32.7.4 release firmware and we observed different behavior with the same tests on TX2-NX custom board.

  1. The systemd-networkd service getting halted and stopped after almost ~8000 iterations of bind/unbind the ethernet eqos driver. The VPU get not responsive after that. To recover, the device must need to reboot. Please see attached logs (systemd-networkd-crash-logs-l4t-3274.txt).
  2. In the second scenario, the VPU is getting rebooted overnight after kernel crashes with core dumps. This happens after ~6000 to ~7000 iterations of bind/unbind the ethernet eqos driver. Please see attached logs (ether-qos-crash-logs-l4t-3274.txt).

Note that, this same issue is not yet reproduced in TX2 based custom device with L4T-R32.7.4 release firmware even after ~11500 iterations.

Can you please guide if this is something known issue or do we have any solution for the same?

Thank You,
Pratik Manvar

systemd-networkd-crash-logs-l4t-3274.txt.txt (14.4 KB)
ether-qos-crash-logs-l4t-3274.txt.txt (55.2 KB)

Hello,

Thanks for visiting the NVIDIA forums! Your topic will be best served in the Jetson category.

I will move this post over for visibility.

Cheers,
Tom

2 Likes

I did my own testing on same custom board (TX2-NX) and logged Linux eqos driver output for good measure (built with -DDEBUG, attachments: eqos_debug_*).
eqos_debug_failure.txt (112.0 KB)
eqos_debug_working.txt (64.5 KB)

The main differences in working an failing cases are:

  • process_rx_completions - is never called in eqos_debug_failure
  • rx_descriptor_reset - is never called in eqos_debug_failure
  • process_tx_completions - dirty_tx counter goes up without any limit in eqos_debug_failure, as the driver attempts to send the data that does not seem to pass through the PHY

I later rebuilt the driver with extra IRQ logging and can confirm that there are no RX interrupts happening on eqos driver side (eg. eqos_enable_chan_rx_interrupt, eqos_disable_chan_rx_interrupt are never called).

I can reproduce the issue quicker when auto-negotiation is turned off on both sides, eg. both peers (directly connected by eth0) call: ethtool -s eth0 speed 100 duplex full autoneg off or ethtool -s eth0 speed 10 duplex half autoneg off.

Extra MDIO debug logs for “autonegotiation off” case are also attached (eqos_mdio_*).
eqos_mdio_failure.txt (9.1 KB)
eqos_mdio_working.txt (74.8 KB)

When failure happens, ethtool eth0 logs “Link detected: no” on our device, but “Link detected: yes” on testing PC.

Speed/duplex does not seem to impact the reproducibility.

I also noticed that while ethtool eth0 logs “Link detected: no” the green LED on ETH0 port on our device is lit.

The LED turns off correctly, when I unplug the cable or call ifconfig eth0 down, implying that PHY believes link to be up while Linux side does not.

The failure is fixed by restarting the interface on DUT again, eg. ifconfig eth0 downifconfig eth0 up.

Best regards,
Wojtek

1 Like

Hello,
Has anyone come across similar kind of issue? Any information or suggestions on this topic would be very helpful. Thank you.

Is this issue able to get reproduced on NV devkit?

Hi @WayneWWW,
Thanks for your quick response.
Yes, we were able to reproduce the same issue on the setup where we have TX2-NX SOM module in a Xavier-NX dev kit.

please help share the steps to reproduce this issue.

Hello @WayneWWW,

We can reproduce the issue using following script which will bind/unbind the eqos driver in loop.

x=0; while `echo 2490000.ether_qos > /sys/bus/platform/drivers/eqos/unbind; sleep 0.2; dmesg -C; echo 2490000.ether_qos > /sys/bus/platform/drivers/eqos/bind`; do sleep 4; dmesg; echo REBOOT_$((x=x+1)) $(date +%T); done &

The issue can be reproduced quickly when auto-negotiation is turned off on both sides, eg. both peers (directly connected by eth0) call:
ethtool -s eth0 speed 100 duplex full autoneg off or ethtool -s eth0 speed 10 duplex half autoneg off .

Thank you!

Is there a real usecase method to reproduce this?

I mean bind/unbind seems not something that conduct consecutively…

Yes, right. The bind/unbind is not the conventional method.
Actually, this issue was first reproduced during the repeated reboot test (stress testing) where the device suddenly was not reachable via Ethernet.

So, to reproduce this issue, we tried the following tests repeatedly.

  • software reboot
  • “ip link set eth0 down” / “ip link set eth0 up”
  • unbind/bind eqos driver

The fastest way to reproduce, would probably be

  • interface up/down:
x=0; while `ip link set eth0 down; sleep 0.2; ip link set eth0 up`; do sleep 1; echo REBOOT_$((x=x+1)) $(date +%T); done

OR

  • unbind/bind eqos driver:
x=0; while `echo 2490000.ether_qos > /sys/bus/platform/drivers/eqos/unbind; sleep 0.2; echo 2490000.ether_qos > /sys/bus/platform/drivers/eqos/bind`; do sleep 1; echo REBOOT_$((x=x+1)) $(date +%T); done

Thanks!