eth0 timeout on tx1 L4T R24.2.1 when bridged

Hi!

eth0 times out on my tx1 flashed with the latest L4T R24.2.1 image if it is bridged.

When performing a large file copy (~100mb) from tx1 to a remote PC via samba, 5-10 MB transfer correctly then the transfer hangs.

My network interfaces are configured this way:

# interfaces(5) file used by ifup(8) and ifdown(8)
# Include files from /etc/network/interfaces.d:
source-directory /etc/network/interfaces.d

auto lo br0
iface lo inet loopback

allow-hotplug eth0
iface eth0 inet manual

iface br0 inet static
	bridge_ports eth0
	address 192.168.0.8
	network 192.168.0.0
	netmask 255.255.255.0
	gateway 192.168.0.100
	dns-nameservers 8.8.8.8 8.8.4.4

eth0 is configured in full gigabit.

dmesg shows:

[  196.011588] ------------[ cut here ]------------
[  196.011604] WARNING: at /dvs/git/dirty/git-master_linux/kernel/net/sched/sch_generic.c:255 dev_watchdog+0x188/0x2b0()
[  196.011609] NETDEV WATCHDOG: eth0 (r8152): transmit queue 0 timed out
[  196.011613] Modules linked in: bnep bcmdhd cfg80211 bluedroid_pm
[  196.011628] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.10.96-tegra #1
[  196.011632] Call trace:
[  196.011639] [<ffffffc000089df4>] dump_backtrace+0x0/0xf4
[  196.011644] [<ffffffc00008a0f4>] show_stack+0x14/0x1c
[  196.011649] [<ffffffc00032a59c>] dump_stack+0x20/0x28
[  196.011655] [<ffffffc0000a72e4>] warn_slowpath_common+0x78/0x9c
[  196.011660] [<ffffffc0000a7358>] warn_slowpath_fmt+0x50/0x58
[  196.011665] [<ffffffc0009afee4>] dev_watchdog+0x188/0x2b0
[  196.011670] [<ffffffc0000b9578>] call_timer_fn+0xa8/0x194
[  196.011675] [<ffffffc0000b9c78>] run_timer_softirq+0x22c/0x280
[  196.011680] [<ffffffc0000b0d74>] __do_softirq+0x180/0x2ec
[  196.011684] [<ffffffc0000b0fc0>] do_softirq+0x48/0x6c
[  196.011689] [<ffffffc0000b12a8>] irq_exit+0x84/0xc8
[  196.011693] [<ffffffc0000859e0>] handle_IRQ+0x98/0xc8
[  196.011697] [<ffffffc0000813d4>] gic_handle_irq+0x74/0x194
[  196.011701] Exception stack(0xffffffc0011d3d90 to 0xffffffc0011d3eb0)
[  196.011706] 3d80:                                     00000002 00000000 1fed2400 ffffffc0
[  196.011712] 3da0: 011d3ef0 ffffffc0 007b0970 ffffffc0 80000145 00000000 00000001 00000000
[  196.011717] 3dc0: 8007b000 00000000 8007d000 00000000 00000000 00000000 000222b5 00000000
[  196.011721] 3de0: 00000000 00000000 00000000 00000000 003c6538 00000000 13c47f94 00251905
[  196.011726] 3e00: 3414f94e 00000000 ffffffff 00ffffff ee500dbf 00000000 00000000 00000000
[  196.011731] 3e20: 1ed1a000 00000000 ffffd765 00000000 000000ae 00000000 000000c0 00000000
[  196.011736] 3e40: 000000ff 00000000 000000ff 00000000 000000ff 00000000 870f57b8 0000007f
[  196.011740] 3e60: 00000014 00000000 00000002 00000000 1fed2400 ffffffc0 00000000 00000000
[  196.011745] 3e80: 1fed2718 ffffffc0 00000000 00000000 00000000 00000000 8007b000 00000000
[  196.011748] 3ea0: 8007d000 00000000 000803f8 ffffffc0
[  196.011753] [<ffffffc000084e04>] el1_irq+0x84/0xf0
[  196.011757] [<ffffffc000086724>] arch_cpu_idle+0xc/0x24
[  196.011763] [<ffffffc0000f9a08>] cpu_idle_loop+0x21c/0x284
[  196.011767] [<ffffffc0000f9a90>] freezing_slow_path+0x0/0x84
[  196.011773] [<ffffffc000b3868c>] rest_init+0x88/0x94
[  196.011777] [<ffffffc00114f8a8>] start_kernel+0x298/0x2a4
[  196.011781] ---[ end trace 6397de4db638950c ]---

However, if I do not bridge eth0 to br0, using the network configuration file below, the transfer work flawlessly!

# interfaces(5) file used by ifup(8) and ifdown(8)
# Include files from /etc/network/interfaces.d:
source-directory /etc/network/interfaces.d

auto lo br0
iface lo inet loopback

allow-hotplug eth0
iface eth0 inet static
	address 192.168.0.8
	network 192.168.0.0
	netmask 255.255.255.0
	gateway 192.168.0.100
	dns-nameservers 8.8.8.8 8.8.4.4

Would you have any ideas on how to solve the issue?

Thank you!

That looks like a legitimate bug of the r8152 driver, but it could also just be coincidence that this was what was running when a scheduler bug hit. The r8152 driver itself has been around a long time, but testing in combination with bridge mode and samba has probably been very limited (it is more likely to be related to scheduler interaction than the r8152 itself).

I have not set up samba, and don’t have a separate windows computer to test with (I’d have to reboot my Linux host…which means I’d lose my Jetson router), so it is difficult to even attempt to reproduce this. Someone who has a windows machine with network neighborhood would be required to test this.

Hi, thanks for your answer.

I have uninstalled samba on my tx1 and performed the same test using iperf.
Running

iperf -s

on the tx1 and

iperf -c [TX1_IP] -d

on another linux machine crashes the r8152 driver with the same kind of dmesg output, if the ethernet link is in full gigabit mode.

This rules out samba, but there is still a question of whether this is scheduler-related or purely r8152. This does simplify testing though.

I tried the perf test a few times, but could not get a crash. The test perhaps completes too soon if there is a requirement for some interaction…which tends to point more at the scheduler and not the r8152. I did not use anything related to bridging, this was purely standard setup (running at gigabit speed on both ends). Under your iperf test, was there anything special about network setup other than being gigabit speed?

The network topology is simple:
linuxpc - gigabit router - tx1

Running iperf without bridging does not crash. Creating a new bridge interface with only eth0 as its bridge “member” (although I also tried to add wlan0 to the mix) and running the test again crashes the driver on my end.

Normally a bridge would be placed between two nodes, though one side of the bridge can be given a second address to be used for non-bridge purposes. How are you setting up your bridge (details for reproducing the issue are needed), and is one side of the bridge set up with an address in addition to bridging mode? I have seen issues in the past with bridging mode, though it was unrelated to the specific network driver.

The initial idea is to bridge wlan0 and eth0.
My initial setup was:
/etc/network/interfaces

auto lo br0
iface lo inet loopback

allow-hotplug wlan0
iface wlan0 inet manual

allow-hotplug eth0
iface eth0 inet manual

iface br0 inet static
	bridge_ports wlan0 eth0
	address 192.168.0.8
	network 192.168.0.0
	netmask 255.255.255.0
	gateway 192.168.0.100
	dns-nameservers 8.8.8.8 8.8.4.4

And hostapd is running on wlan0, and set to work with the bridge br0.

In this configuration, wlan0 and eth0 share the same network. Connecting and transfering via wlan0 works well, but doing so via eth0 will result in that crash.

I then removed wlan0 from the bridge (the configuration is now the same than the one in my first post), and the problem is still there.

I have bridged like this in the past on my openwrt routers, with one or two interfaces, and it worked well.
There may be a better way to bridge the two interfaces though…

I am not a bridge expert, but your config looks ok to me. In the past I’ve seen kernel OOPS from a bridge configuration which looked ok, but turned out that some form of traffic caused the bridge to essentially go recursive (this was on 32-bit ARM hardware using a Tegra 3). In this latter case there was an assigned address such that traffic went through and then routed back to the assigned address. The bridge should have handled it more gracefully, but I’m thinking fixes may not exist in the 3.10 kernels (I’m sure it was fixed in the 4.x kernels). I can’t really say that this is the cause, but what is the output of “route” before and after bringing up the bridge (I’m assuming there is some case of being able to bring up the bridge without immediate error, e.g., with the ethernet cable unplugged or wlan0 down)? On this older Tegra 3 system the fix was a configuration change, though I don’t remember what the actual fix was.

Here is my route before the bridge is up, when eth0 is configured in a simple network:

Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         192.168.0.100   0.0.0.0         UG    0      0        0 eth0
link-local      *               255.255.0.0     U     1000   0        0 eth0
192.168.0.0     *               255.255.255.0   U     0      0        0 eth0

After the bridge is up:

Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         192.168.0.100   0.0.0.0         UG    0      0        0 br0
link-local      *               255.255.0.0     U     1000   0        0 br0
192.168.0.0     *               255.255.255.0   U     0      0        0 br0

The thing is, the bridge works for a bit when the data flow is low through eth0 (I have internet access through eth0 on my tx1 when the bridge is up)

I am wondering if spanning tree protocol is doing something you don’t expect. See if it is enabled (post this if you can):

sudo brctl show

If STP is enabled, try disabling it (assumes bridge is br0…it’s actually best to do this before adding an interface to the bridge):

sudo brctl stop br0 stop

Also, what is the output of ifconfig at this point?

brctl show:

bridge name     bridge id               STP enabled     interfaces
br0             8000.00044b66cca1       no              eth0
                                                        wlan0

STP is hence disabled

ifconfig shows:

br0       Link encap:Ethernet  HWaddr 00:04:4b:66:cc:a1
          inet addr:192.168.0.8  Bcast:192.168.0.255  Mask:255.255.255.0
          inet6 addr: fe80::204:4bff:fe66:cca3/64 Scope:Link
          inet6 addr: fd70:e404:88d0:0:a1d4:6f97:c8c0:97ff/64 Scope:Global
          inet6 addr: fd70:e404:88d0:0:204:4bff:fe66:cca3/64 Scope:Global
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:219 errors:0 dropped:0 overruns:0 frame:0
          TX packets:319 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:19997 (19.9 KB)  TX bytes:33386 (33.3 KB)

eth0      Link encap:Ethernet  HWaddr 00:04:4b:66:cc:a3
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:249 errors:0 dropped:0 overruns:0 frame:0
          TX packets:316 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:28249 (28.2 KB)  TX bytes:37630 (37.6 KB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:91 errors:0 dropped:0 overruns:0 frame:0
          TX packets:91 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:7855 (7.8 KB)  TX bytes:7855 (7.8 KB)

wlan0     Link encap:Ethernet  HWaddr 00:04:4b:66:cc:a1
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:156 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B)  TX bytes:17295 (17.2 KB)

I’m at a bit of a disadvantage here because I don’t have WiFi (I’ve avoided it for many reasons) and just go through the bridge setup steps, but I’m wondering if the wlan0 entry from “brctl show” is correct. Having an interface which requires hot plug complicates things.

For now I’m going to point you at a bridge configuration article here:
http://www.tldp.org/HOWTO/Ethernet-Bridge-netfilter-HOWTO-3.html
…other than the part for making the effects persistent over reboots see if info on setting up routing and forwarding help. If not I may attempt to do some wireless setup (under normal circumstances I’d probably chew my own leg off before messing with wireless on a working system :P) to see what happens when bridging WiFi.

I hope you won’t wreck your setup, I would feel bad if you did!
I’ll give this link a try. Removing the WiFi from the bridge, and removing the allow hotplug lines, does not make the crash disappear though, it would seem that the issue is the fact that eth0 does not like(?) being in a bridge, regardless of its other members. The issue is reproducible if eth0 is set as the only member of bridge br0

I’ve tried this with just eth0 as a member of the bridge, but no crash. I’m not sure what the trigger was on a single node bridge, but more info on the moment of the crash might help.

One thing I’m wondering about is if two addresses overlap. In your current setup I see br0 is assigned 192.168.0.8. Just for testing purposes, without changing anything else in networking, what happens if you use a different subnet? E.G., 192.168.3.8? I’m trying to intentionally put this bridge into a subnet which nothing else will want to use (or which was previously unused).

Hi,

I am new to this topic. What is the purpose of creating a new bridge interface with only eth0 in it?
Is it for experiment?

The initial purpose was to bridge eth0 and wlan0, but since it crashed, I tried to isolate the source of the problem, and noticed that i could reproduce the crash with only eth0 in the bridge.

I am not set up for WiFi and was asking about ways to reproduce the problem when he noticed WiFi did not need to be involved. So far I am unable to reproduce using just the wired side of the bridge.

One important note is that in the past (perhaps also a 3.x kernel) I had used bridging on armhf (Tegra 3) and generated an OOPS with the same Realtek driver. This turned out to be a root cause of configuration; the OOPS was simply poor handling of an illegal configuration (the failure should have been handled more gracefully, but it was my fault for using that configuration). What it came down to is that I have multiple wired networks, and if there exists a route such that a bridge can feed back to itself and be forced to recursively pass traffic through one node or the bridge as a whole this would happen. I suspect he is running into an incorrect configuration which has an ungraceful error handling response, but I can’t guarantee it…triggering such a configuration to fail would depend on traffic passing through it (possibly ARP not having any specific user action as trigger).

Changing subnets does not change anything, and I followed the bridging link you posted, and it still crashes. I will try flashing to a freshly generated Jet pack filesystem to see if anything I installed could be the cause.

Even if there is a configuration error causing the crash this would still be a legitimate bug…the nature would just differ between error handling issues versus a bug of a valid (non-error) configuration. I just have not been able to reproduce it myself (I’ve only done limited testing), and suspect the bug only shows up with particular combinations of hardware and configuration (I’ve seen similar issues with bridging in the past on armhf). It is worth pursuing why this occurs.