VLAN problem with TX2

Hello

I’m using the L4T version 28.2.1. When transferring some data from a jetson TX2 to another linux pc (or to another jetson tx2 board) within a vlan the tx2 network driver seems to get in a deadlock.

You can easily reproduce this problem by doing following steps:

Setup of the target system (e.g. ubuntu):

  • apt-get install vlan
  • modify /etc/network/interfaces:
    auto enp2s0
    iface enp2s0 inet dhcp
    
    auto enp2s0.1234
    iface enp2s0.1234 inet static
      address 172.31.254.1
      netmask 255.255.255.0
    
  • reboot

On tx2 (source):

  • apt-get install vlan
  • modify /etc/network/interfaces:
    auto eth0
    iface eth0 inet dhcp
    
    auto eth0.1234
    iface eth0.1234 inet static
      address 172.31.254.2
      netmask 255.255.255.0
    
  • reboot
  • execute following commands via ssh (eth0):
    dd if=/dev/urandom of=/tmp/test.dat bs=1M count=100
    while [ 1 == 1 ]; do i=$(($i + 1)); echo $i; sshpass -p 'password' scp /tmp/test.dat asdf@172.31.254.1:/tmp; done
    

=> after some seconds or minutes the scp command stucks and the TX2 can’t be ping’ed on the eth0 or the eth0.1234 interface anymore. On the debug UART of the TX2 I can’t see any error message (via dmesg).

After a ifconfig eth0 down and up the network is working again.

Is this a bug of the TX2 eqos ethernet driver? How can this be fixed?

Regards
Werner

I know nothing about vlan, but I’ll suggest that you go to the serial console and monitor “dmesg --follow” before starting. Then start your test and see if anything shows up in dmesg.

Just prior to your test you might also save a copy of the output from:

ifconfig
route

I have successfully executed the same test for more than an hour on a jetson tk1 evaluation board (with L4T 21.7.0; modified kernel config: CONFIG_MACVLAN=y and CONFIG_VLAN_8021Q=m). However on a tx2 board this test stuck within some seconds/minutes.

On the serial console I don’t get any message when the error occurs. After the error a route command may take up to 10 seconds till it return back again. The route and also the ifconfig command return back with the same result as before the error.
When executing the test, the TX quantity of the eth0 is about 5x more then from the eth0.1234:

nvidia@tegra-ubuntu:~$ ifconfig
eth0      Link encap:Ethernet  HWaddr xxx
          inet addr:192.168.0.1  Bcast:192.168.0.255  Mask:255.255.252.0
          inet6 addr: xxx Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:345802 errors:0 dropped:0 overruns:0 frame:0
          TX packets:331532 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:24900932 (24.9 MB)  TX bytes:26799902726 (26.7 GB)
          Interrupt:42

eth0.1234 Link encap:Ethernet  HWaddr xxx
          inet addr:172.31.254.2  Bcast:172.31.254.255  Mask:255.255.255.0
          inet6 addr: xxx Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:336189 errors:0 dropped:0 overruns:0 frame:0
          TX packets:331380 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:19208608 (19.2 MB)  TX bytes:5383352516 (5.3 GB)

Whereas the TX quantity is nearly the same for eth0 and eth0.1234 on the jetson tk1 board.

The test is also successfull when the vlan is used over a intel network card (82574L intel chipset) with the tx2. So it seems for me, that there is a bug a the eqos hardware or hardware driver.

Do keep in mind I am not familiar with vlan setup, and in particular I’m not sure about the “eth0.1234” syntax.

What I do know is that the above ifconfig output showed as normal operation without any kind of conflict, but “route” should not take 10 seconds…this would tend to imply a timeout from some sort of configuration error. What is the actual output from “route”? It wouldn’t be unusual for a bad route setup to cause the equivalent of a lockup. I have seen something very similar when a bridge was set to send output from one side back to itself in an infinite loop.

The number followed after eth0 or enp2s0 is the VLAN ID. This can be any number between 1 and 4094 and must match to the other network adapter to be within the same vlan.

Here the result of the route:

nvidia@tegra-ubuntu:~$ route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         proxyname       0.0.0.0         UG    0      0        0 eth0
link-local      *               255.255.0.0     U     1000   0        0 l4tbr0
172.17.0.0      *               255.255.0.0     U     0      0        0 docker0
172.31.254.0    *               255.255.255.0   U     0      0        0 eth0.1234
192.168.0.0     *               255.255.252.0   U     0      0        0 eth0
192.168.55.0    *               255.255.255.0   U     0      0        0 l4tbr0

After the test/error the tx2 can’t resolve the name of the proxy anymore and I think this is the reason why it takes up to about 10 seconds. In this case the result of the route is the same except that the name of the proxy changes to its ip address.

I have now tried to add 2 commits regarding to the vlan from the L31.1 (xavier) to the L28.2.1 (tx2) kernel:
http://nv-tegra.nvidia.com/gitweb/?p=linux-4.9.git;a=commit;h=75cceb81f7ab45b606687f797bb50e0f5519a07f
http://nv-tegra.nvidia.com/gitweb/?p=linux-4.9.git;a=commit;h=1cf3b8b7e2186394f3d43e5cfa36838ea22892ca

The test can now be successfully executed. But in the ifconfig the TX data amount of the physical adapter is about 5x more than from the vlan adapter:

eth0      Link encap:Ethernet  HWaddr xxx
          inet addr:192.168.0.1  Bcast:192.168.0.255  Mask:255.255.252.0
          inet6 addr: xxx Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:9154481 errors:0 dropped:0 overruns:0 frame:0
          TX packets:9408588 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:661071439 (661.0 MB)  TX bytes:779149769384 (779.1 GB)
          Interrupt:42

eth0.1234 Link encap:Ethernet  HWaddr xxx
          inet addr:172.31.254.1  Bcast:172.31.254.255  Mask:255.255.255.0
          inet6 addr: xxx Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:9033550 errors:0 dropped:0 overruns:0 frame:0
          TX packets:9405546 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:519284133 (519.2 MB)  TX bytes:162756958643 (162.7 GB)
...

@nvidia: can you please fix this in the next L4T TX2 release?

I lack experience with VLANs, so there isn’t a lot I can say other than that the ifconfig and route output seems ok and without conflict. Perhaps someone knowing more about VLANs can comment on the performance side.