We are using TX2 and TX2-NX based custom boards.
With L4T-R32.7.3 firmware, during the stress test of reboot, we observed that the device is unreachable over the network as eth0 is down. This issue is reproducible in both TX2 and TX2-NX devices.
We can easily reproduce the issue using following command which will bind/unbind the eqos driver in loop.
In faulty states it shows eth0 link is not getting up. please see below logs. This took from ~200 to ~2000 loop iterations to reach the faulty state, so it seems to happen at random.
root@ovp81x-68-26-d3 (productive):~# dmesg | grep eth0
[13030.833213] eqos 2490000.ether_qos eth0: Link is Up - 100Mbps/Full - flow control off
[13030.833956] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[13034.158468] net eth0: get_configure_l3v4_filter -->
[13034.159727] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
After that, we tried using the L4T-R32.7.4 release firmware and we observed different behavior with the same tests on TX2-NX custom board.
The systemd-networkd service getting halted and stopped after almost ~8000 iterations of bind/unbind the ethernet eqos driver. The VPU get not responsive after that. To recover, the device must need to reboot. Please see attached logs (systemd-networkd-crash-logs-l4t-3274.txt).
In the second scenario, the VPU is getting rebooted overnight after kernel crashes with core dumps. This happens after ~6000 to ~7000 iterations of bind/unbind the ethernet eqos driver. Please see attached logs (ether-qos-crash-logs-l4t-3274.txt).
Note that, this same issue is not yet reproduced in TX2 based custom device with L4T-R32.7.4 release firmware even after ~11500 iterations.
Can you please guide if this is something known issue or do we have any solution for the same?
I did my own testing on same custom board (TX2-NX) and logged Linux eqos driver output for good measure (built with -DDEBUG, attachments: eqos_debug_*). eqos_debug_failure.txt (112.0 KB) eqos_debug_working.txt (64.5 KB)
The main differences in working an failing cases are:
process_rx_completions - is never called in eqos_debug_failure
rx_descriptor_reset - is never called in eqos_debug_failure
process_tx_completions - dirty_tx counter goes up without any limit in eqos_debug_failure, as the driver attempts to send the data that does not seem to pass through the PHY
I later rebuilt the driver with extra IRQ logging and can confirm that there are no RX interrupts happening on eqos driver side (eg. eqos_enable_chan_rx_interrupt, eqos_disable_chan_rx_interrupt are never called).
I can reproduce the issue quicker when auto-negotiation is turned off on both sides, eg. both peers (directly connected by eth0) call: ethtool -s eth0 speed 100 duplex full autoneg off or ethtool -s eth0 speed 10 duplex half autoneg off.
Hi @WayneWWW,
Thanks for your quick response.
Yes, we were able to reproduce the same issue on the setup where we have TX2-NX SOM module in a Xavier-NX dev kit.
The issue can be reproduced quickly when auto-negotiation is turned off on both sides, eg. both peers (directly connected by eth0) call: ethtool -s eth0 speed 100 duplex full autoneg off or ethtool -s eth0 speed 10 duplex half autoneg off .
Yes, right. The bind/unbind is not the conventional method.
Actually, this issue was first reproduced during the repeated reboot test (stress testing) where the device suddenly was not reachable via Ethernet.
So, to reproduce this issue, we tried the following tests repeatedly.
software reboot
“ip link set eth0 down” / “ip link set eth0 up”
unbind/bind eqos driver
The fastest way to reproduce, would probably be
interface up/down:
x=0; while `ip link set eth0 down; sleep 0.2; ip link set eth0 up`; do sleep 1; echo REBOOT_$((x=x+1)) $(date +%T); done
OR
unbind/bind eqos driver:
x=0; while `echo 2490000.ether_qos > /sys/bus/platform/drivers/eqos/unbind; sleep 0.2; echo 2490000.ether_qos > /sys/bus/platform/drivers/eqos/bind`; do sleep 1; echo REBOOT_$((x=x+1)) $(date +%T); done
Hello,
Just an update, we recently migrated to the latest l4t release v32.7.5.
Seems like some improvement with this release, but this issue (sometime eth0 link could not get up during boot-up) is still reproducible with continuous reboot test.
So, as a workaround we added below script and call it from systemd service to reset the eth0 interface if not already up during boot-up.
#!/bin/sh
ETH0_STATE=$(cat /sys/class/net/eth0/operstate)
ETH0_TX_PACKETS=$(cat /sys/class/net/eth0/statistics/tx_packets)
ETH0_RX_PACKETS=$(cat /sys/class/net/eth0/statistics/rx_packets)
if [ "$ETH0_STATE" = "down" ] && [ "$ETH0_TX_PACKETS" = 0 ] && [ "$ETH0_RX_PACKETS" = 0 ]; then
echo "Resetting eth0 interface.."
dmesg | grep eth0 >> /run/eth0-interface-reset.txt
echo "Resetting eth0 interface.." >> /run/eth0-interface-reset.txt
ip link set eth0 down
ip link set eth0 up
echo "..Done" >> /run/eth0-interface-reset.txt
echo "..Done"
fi
Then after, it will request and enable interrupts for a PHY device with a valid IRQ number (which in our case is 353). Then PHY device will be configured successfully for the requested interrupt.
The above flow is working as expected without any error reported. But seems like the interrupt is not generated from PHY device and ultimately the adjust link handler is not getting called.
Can you please help us to understand the above behavior?