Device suddenly not reachable via Ethernet during stress testing on TX2/TX2-NX

pratik.manvar · April 17, 2024, 11:50am

We are using TX2 and TX2-NX based custom boards.
With L4T-R32.7.3 firmware, during the stress test of reboot, we observed that the device is unreachable over the network as eth0 is down. This issue is reproducible in both TX2 and TX2-NX devices.

We can easily reproduce the issue using following command which will bind/unbind the eqos driver in loop.

x=0; while `echo 2490000.ether_qos > /sys/bus/platform/drivers/eqos/unbind; sleep 0.2; dmesg -C; echo 2490000.ether_qos > /sys/bus/platform/drivers/eqos/bind`; do sleep 4; dmesg; echo REBOOT_$((x=x+1)) $(date +%T); done &

In faulty states it shows eth0 link is not getting up. please see below logs. This took from ~200 to ~2000 loop iterations to reach the faulty state, so it seems to happen at random.

   root@ovp81x-68-26-d3 (productive):~# dmesg | grep eth0
   [13030.833213] eqos 2490000.ether_qos eth0: Link is Up - 100Mbps/Full - flow control off
   [13030.833956] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
   [13034.158468] net eth0: get_configure_l3v4_filter -->
   [13034.159727] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready

After that, we tried using the L4T-R32.7.4 release firmware and we observed different behavior with the same tests on TX2-NX custom board.

The systemd-networkd service getting halted and stopped after almost ~8000 iterations of bind/unbind the ethernet eqos driver. The VPU get not responsive after that. To recover, the device must need to reboot. Please see attached logs (systemd-networkd-crash-logs-l4t-3274.txt).
In the second scenario, the VPU is getting rebooted overnight after kernel crashes with core dumps. This happens after ~6000 to ~7000 iterations of bind/unbind the ethernet eqos driver. Please see attached logs (ether-qos-crash-logs-l4t-3274.txt).

Note that, this same issue is not yet reproduced in TX2 based custom device with L4T-R32.7.4 release firmware even after ~11500 iterations.

Can you please guide if this is something known issue or do we have any solution for the same?

Thank You,
Pratik Manvar

systemd-networkd-crash-logs-l4t-3274.txt.txt (14.4 KB)
ether-qos-crash-logs-l4t-3274.txt.txt (55.2 KB)

TomNVIDIA · April 17, 2024, 2:06pm

Hello,

Thanks for visiting the NVIDIA forums! Your topic will be best served in the Jetson category.

I will move this post over for visibility.

Cheers,
Tom

wzajac · April 18, 2024, 10:36am

I did my own testing on same custom board (TX2-NX) and logged Linux eqos driver output for good measure (built with -DDEBUG, attachments: eqos_debug_*).
eqos_debug_failure.txt (112.0 KB)
eqos_debug_working.txt (64.5 KB)

The main differences in working an failing cases are:

process_rx_completions - is never called in eqos_debug_failure
rx_descriptor_reset - is never called in eqos_debug_failure
process_tx_completions - dirty_tx counter goes up without any limit in eqos_debug_failure, as the driver attempts to send the data that does not seem to pass through the PHY

I later rebuilt the driver with extra IRQ logging and can confirm that there are no RX interrupts happening on eqos driver side (eg. eqos_enable_chan_rx_interrupt, eqos_disable_chan_rx_interrupt are never called).

I can reproduce the issue quicker when auto-negotiation is turned off on both sides, eg. both peers (directly connected by eth0) call: ethtool -s eth0 speed 100 duplex full autoneg off or ethtool -s eth0 speed 10 duplex half autoneg off.

Extra MDIO debug logs for “autonegotiation off” case are also attached (eqos_mdio_*).
eqos_mdio_failure.txt (9.1 KB)
eqos_mdio_working.txt (74.8 KB)

When failure happens, ethtool eth0 logs “Link detected: no” on our device, but “Link detected: yes” on testing PC.

Speed/duplex does not seem to impact the reproducibility.

I also noticed that while ethtool eth0 logs “Link detected: no” the green LED on ETH0 port on our device is lit.

The LED turns off correctly, when I unplug the cable or call ifconfig eth0 down, implying that PHY believes link to be up while Linux side does not.

The failure is fixed by restarting the interface on DUT again, eg. ifconfig eth0 down → ifconfig eth0 up.

Best regards,
Wojtek

pratik.manvar · April 30, 2024, 8:26am

Hello,
Has anyone come across similar kind of issue? Any information or suggestions on this topic would be very helpful. Thank you.

WayneWWW · April 30, 2024, 8:48am

Is this issue able to get reproduced on NV devkit?

pratik.manvar · April 30, 2024, 9:13am

Hi @WayneWWW,
Thanks for your quick response.
Yes, we were able to reproduce the same issue on the setup where we have TX2-NX SOM module in a Xavier-NX dev kit.

WayneWWW · May 14, 2024, 8:50am

please help share the steps to reproduce this issue.

pratik.manvar · May 14, 2024, 9:25am

Hello @WayneWWW,

We can reproduce the issue using following script which will bind/unbind the eqos driver in loop.

x=0; while `echo 2490000.ether_qos > /sys/bus/platform/drivers/eqos/unbind; sleep 0.2; dmesg -C; echo 2490000.ether_qos > /sys/bus/platform/drivers/eqos/bind`; do sleep 4; dmesg; echo REBOOT_$((x=x+1)) $(date +%T); done &

The issue can be reproduced quickly when auto-negotiation is turned off on both sides, eg. both peers (directly connected by eth0) call:
ethtool -s eth0 speed 100 duplex full autoneg off or ethtool -s eth0 speed 10 duplex half autoneg off .

Thank you!

WayneWWW · May 14, 2024, 9:58am

Is there a real usecase method to reproduce this?

I mean bind/unbind seems not something that conduct consecutively…

pratik.manvar · May 14, 2024, 11:17am

Yes, right. The bind/unbind is not the conventional method.
Actually, this issue was first reproduced during the repeated reboot test (stress testing) where the device suddenly was not reachable via Ethernet.

So, to reproduce this issue, we tried the following tests repeatedly.

software reboot
“ip link set eth0 down” / “ip link set eth0 up”
unbind/bind eqos driver

The fastest way to reproduce, would probably be

interface up/down:

x=0; while `ip link set eth0 down; sleep 0.2; ip link set eth0 up`; do sleep 1; echo REBOOT_$((x=x+1)) $(date +%T); done

OR

unbind/bind eqos driver:

x=0; while `echo 2490000.ether_qos > /sys/bus/platform/drivers/eqos/unbind; sleep 0.2; echo 2490000.ether_qos > /sys/bus/platform/drivers/eqos/bind`; do sleep 1; echo REBOOT_$((x=x+1)) $(date +%T); done

Thanks!

pratik.manvar · July 31, 2024, 9:47am

Hello,
Just an update, we recently migrated to the latest l4t release v32.7.5.
Seems like some improvement with this release, but this issue (sometime eth0 link could not get up during boot-up) is still reproducible with continuous reboot test.

So, as a workaround we added below script and call it from systemd service to reset the eth0 interface if not already up during boot-up.

#!/bin/sh

ETH0_STATE=$(cat /sys/class/net/eth0/operstate)
ETH0_TX_PACKETS=$(cat /sys/class/net/eth0/statistics/tx_packets)
ETH0_RX_PACKETS=$(cat /sys/class/net/eth0/statistics/rx_packets)

if [ "$ETH0_STATE" = "down" ] && [ "$ETH0_TX_PACKETS" = 0 ] && [ "$ETH0_RX_PACKETS" = 0 ]; then
    echo "Resetting eth0 interface.."
    dmesg | grep eth0 >> /run/eth0-interface-reset.txt
    echo "Resetting eth0 interface.." >> /run/eth0-interface-reset.txt
    ip link set eth0 down
    ip link set eth0 up
    echo "..Done" >> /run/eth0-interface-reset.txt
    echo "..Done"
fi

Thanks,
Pratik Manvar

pratik.manvar · July 31, 2024, 11:59am

Hello,

We are still debugging with latest l4t release v32.7.5 and trying to understand the actual root cause of this issue.

Please note that we are not using bind/unbind of eqos driver but instead we are using actual continuous reboot test for reproducing the issue.

As per our observation, the eqos_adjust_link() handler is not getting called which is responsible to set the ethernet link.

The eqos_adjust_link handler is registered from eqos_init_phy() → of_phy_connect() → phy_connect_direct() → phy_attach_direct() → phy_prepare_link() → phydev->adjust_link = handler

Then after, it will request and enable interrupts for a PHY device with a valid IRQ number (which in our case is 353). Then PHY device will be configured successfully for the requested interrupt.

Below is a flow continue from above.

phy_connect_direct() → phy_start_interrupt() → request_irq() → phy_enable_interrupts() → phy_config_interrupt() → phydev->drv->config_intr()

The above flow is working as expected without any error reported. But seems like the interrupt is not generated from PHY device and ultimately the adjust link handler is not getting called.

Can you please help us to understand the above behavior?

Thanks!

Topic		Replies	Views
About NX reboot many times,the eth0 does not work Jetson Xavier NX boot	42	2155	November 23, 2022
Networking performance issue Jetson TX2 ethernet	20	2721	October 18, 2021
Ethernet do not work on TX2 Jetson TX2 ethernet	53	2157	January 20, 2022
Jetson TX1 Ethernet Speed Issues - Stuck at 10Mbps Jetson TX1	15	2206	October 18, 2021
Jetson tx2 driver bind/unbind Jetson TX2 kernel	15	1209	February 23, 2022
TX2 Ethernet interface going up and down at 1Gbps Jetson TX2	11	2286	October 8, 2020
Ethernet Connectivity Issues Jetson TX2	27	7362	July 31, 2020
TX2 network card crash by connecting [InteL(R) I210/I211 Gigabit Network] with network cable connection Jetson TX2	11	2026	October 18, 2021
After changing PHY 88E1512PB2 to RTL8211FI-CG, eqos_open failed Jetson AGX Xavier ethernet	21	2271	August 10, 2022
Jetson TX2 crashing due to Ethernet connection Jetson TX2 kernel , ethernet	34	2324	November 17, 2021

Device suddenly not reachable via Ethernet during stress testing on TX2/TX2-NX

Related topics