I’m using the L4T version 28.2.1. When transferring some data from a jetson TX2 to another linux pc (or to another jetson tx2 board) within a vlan the tx2 network driver seems to get in a deadlock.
You can easily reproduce this problem by doing following steps:
Setup of the target system (e.g. ubuntu):
apt-get install vlan
modify /etc/network/interfaces:
auto enp2s0
iface enp2s0 inet dhcp
auto enp2s0.1234
iface enp2s0.1234 inet static
address 172.31.254.1
netmask 255.255.255.0
reboot
On tx2 (source):
apt-get install vlan
modify /etc/network/interfaces:
auto eth0
iface eth0 inet dhcp
auto eth0.1234
iface eth0.1234 inet static
address 172.31.254.2
netmask 255.255.255.0
=> after some seconds or minutes the scp command stucks and the TX2 can’t be ping’ed on the eth0 or the eth0.1234 interface anymore. On the debug UART of the TX2 I can’t see any error message (via dmesg).
After a ifconfig eth0 down and up the network is working again.
Is this a bug of the TX2 eqos ethernet driver? How can this be fixed?
I know nothing about vlan, but I’ll suggest that you go to the serial console and monitor “dmesg --follow” before starting. Then start your test and see if anything shows up in dmesg.
Just prior to your test you might also save a copy of the output from:
I have successfully executed the same test for more than an hour on a jetson tk1 evaluation board (with L4T 21.7.0; modified kernel config: CONFIG_MACVLAN=y and CONFIG_VLAN_8021Q=m). However on a tx2 board this test stuck within some seconds/minutes.
On the serial console I don’t get any message when the error occurs. After the error a route command may take up to 10 seconds till it return back again. The route and also the ifconfig command return back with the same result as before the error.
When executing the test, the TX quantity of the eth0 is about 5x more then from the eth0.1234:
The test is also successfull when the vlan is used over a intel network card (82574L intel chipset) with the tx2. So it seems for me, that there is a bug a the eqos hardware or hardware driver.
Do keep in mind I am not familiar with vlan setup, and in particular I’m not sure about the “eth0.1234” syntax.
What I do know is that the above ifconfig output showed as normal operation without any kind of conflict, but “route” should not take 10 seconds…this would tend to imply a timeout from some sort of configuration error. What is the actual output from “route”? It wouldn’t be unusual for a bad route setup to cause the equivalent of a lockup. I have seen something very similar when a bridge was set to send output from one side back to itself in an infinite loop.
The number followed after eth0 or enp2s0 is the VLAN ID. This can be any number between 1 and 4094 and must match to the other network adapter to be within the same vlan.
Here the result of the route:
nvidia@tegra-ubuntu:~$ route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
default proxyname 0.0.0.0 UG 0 0 0 eth0
link-local * 255.255.0.0 U 1000 0 0 l4tbr0
172.17.0.0 * 255.255.0.0 U 0 0 0 docker0
172.31.254.0 * 255.255.255.0 U 0 0 0 eth0.1234
192.168.0.0 * 255.255.252.0 U 0 0 0 eth0
192.168.55.0 * 255.255.255.0 U 0 0 0 l4tbr0
After the test/error the tx2 can’t resolve the name of the proxy anymore and I think this is the reason why it takes up to about 10 seconds. In this case the result of the route is the same except that the name of the proxy changes to its ip address.
I lack experience with VLANs, so there isn’t a lot I can say other than that the ifconfig and route output seems ok and without conflict. Perhaps someone knowing more about VLANs can comment on the performance side.