Networking performance issue

We encouter issues with Ethernet performance especially since last update of our TX2 to jetpack 4.4.1.
Our network is composed by a fleet of tx2 boards.

In order to find error, we have made a test with the TX2 on its evaluation board and a USB to Gigabit Ethernet adapter plugged on a laptop (to connect it to the TX2).

We’ve seen strange behavior on our network. so we began to investigate.
To emphase the issue, we are using iperf tool.
To validate our tests we’ve made tests with many TX2.

First of all, we’ve plugged our computer with the same adapter to a Linux computer. All our tests are ok We are really near the Gigabit theorical limitation (~930Mb/sec).

On the TX2, we use the nvidia graphic tool to configure the TX2 from scratch.

After we finalize TX2 installation with these commands:

sudo apt update
sudo apt upgrade
sudo apt autoremove

We’ve updated network configuration setting the file etc/network/interfaces

# interfaces(5) file used by ifup(8) and ifdown(8)
# Include files from /etc/network/interfaces.d:
source-directory /etc/network/interfaces.d

# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface
auto eth0
allow-hotplug eth0
iface eth0 inet static
        address 172.16.150.1
        netmask 255.255.0.0

And we we’re able to run iperf tests.

The tx2 hosts the iperf server with the command:

> iperf -s -p 10000

and we can start the client on the laptop

> iperf -c 172.16.150.1 -p 10000 -t 20 -i 1 -r

Here here our results:

------------------------------------------------------------
Server listening on TCP port 10000
TCP window size: 1.00 MByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 172.16.150.1, TCP port 10000
TCP window size:  512 KByte (default)
------------------------------------------------------------
[  4] local 172.16.100.100 port 56425 connected with 172.16.150.1 port 10000
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0- 1.0 sec   110 MBytes   922 Mbits/sec
[  4]  1.0- 2.0 sec   111 MBytes   930 Mbits/sec
[  4]  2.0- 3.0 sec   110 MBytes   923 Mbits/sec
[  4]  3.0- 4.0 sec   110 MBytes   923 Mbits/sec
[  4]  4.0- 5.0 sec   110 MBytes   924 Mbits/sec
[  4]  5.0- 6.0 sec   111 MBytes   930 Mbits/sec
[  4]  6.0- 7.0 sec   111 MBytes   928 Mbits/sec
[  4]  7.0- 8.0 sec   110 MBytes   924 Mbits/sec
[  4]  8.0- 9.0 sec   110 MBytes   922 Mbits/sec
[  4]  9.0-10.0 sec   111 MBytes   928 Mbits/sec
[  4] 10.0-11.0 sec   110 MBytes   926 Mbits/sec
[  4] 11.0-12.0 sec   110 MBytes   927 Mbits/sec
[  4] 12.0-13.0 sec   110 MBytes   920 Mbits/sec
[  4] 13.0-14.0 sec   110 MBytes   925 Mbits/sec
[  4] 14.0-15.0 sec   111 MBytes   929 Mbits/sec
[  4] 15.0-16.0 sec   111 MBytes   932 Mbits/sec
[  4] 16.0-17.0 sec   110 MBytes   926 Mbits/sec
[  4] 17.0-18.0 sec   110 MBytes   923 Mbits/sec
[  4] 18.0-19.0 sec   110 MBytes   923 Mbits/sec
[  4] 19.0-20.0 sec   111 MBytes   928 Mbits/sec
[  4]  0.0-20.0 sec  2.15 GBytes   925 Mbits/sec
[  4] local 172.16.100.100 port 10000 connected with 172.16.150.1 port 60104
[  4]  0.0- 1.0 sec  69.0 MBytes   579 Mbits/sec
[  4]  1.0- 2.0 sec  9.88 MBytes  82.9 Mbits/sec
[  4]  2.0- 3.0 sec  13.8 MBytes   116 Mbits/sec
[  4]  3.0- 4.0 sec  9.79 MBytes  82.1 Mbits/sec
[  4]  4.0- 5.0 sec  9.98 MBytes  83.7 Mbits/sec
[  4]  5.0- 6.0 sec  10.4 MBytes  87.2 Mbits/sec
[  4]  6.0- 7.0 sec  13.6 MBytes   114 Mbits/sec
[  4]  7.0- 8.0 sec  12.0 MBytes   101 Mbits/sec
[  4]  8.0- 9.0 sec  10.0 MBytes  83.9 Mbits/sec
[  4]  9.0-10.0 sec  9.93 MBytes  83.3 Mbits/sec
[  4] 10.0-11.0 sec  16.5 MBytes   138 Mbits/sec
[  4] 11.0-12.0 sec  11.7 MBytes  98.4 Mbits/sec
[  4] 12.0-13.0 sec  18.4 MBytes   155 Mbits/sec
[  4] 13.0-14.0 sec  10.1 MBytes  84.6 Mbits/sec
[  4] 14.0-15.0 sec  15.3 MBytes   129 Mbits/sec
[  4] 15.0-16.0 sec  19.4 MBytes   163 Mbits/sec
[  4] 16.0-17.0 sec  10.8 MBytes  90.7 Mbits/sec
[  4] 17.0-18.0 sec  12.1 MBytes   102 Mbits/sec
[  4] 18.0-19.0 sec  10.2 MBytes  85.4 Mbits/sec
[  4] 19.0-20.0 sec  9.73 MBytes  81.6 Mbits/sec
[  4]  0.0-20.4 sec   309 MBytes   127 Mbits/sec
[SUM]  0.0-20.4 sec   378 MBytes   155 Mbits/sec

The problem does not occurs each time. Sometime results are better but sometimes results are very poor. Sometimes we can observe sometimes 0 bytes/sec of performance.

The connection is so poor sometimes that ssh session are very difficult to establish and when connected, we’ve got 1 digit per second displayed on screen.

To illustrate we’ve reproduce a 0 Bytes/sec issue:

------------------------------------------------------------
Server listening on TCP port 10000
TCP window size: 1.00 MByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 172.16.150.1, TCP port 10000
TCP window size:  512 KByte (default)
------------------------------------------------------------
[  4] local 172.16.100.100 port 58168 connected with 172.16.150.1 port 10000
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0- 1.0 sec   512 KBytes  4.19 Mbits/sec
[  4]  1.0- 2.0 sec  0.00 Bytes  0.00 bits/sec
[  4]  2.0- 3.0 sec  0.00 Bytes  0.00 bits/sec
[  4]  3.0- 4.0 sec  0.00 Bytes  0.00 bits/sec
[  4]  4.0- 5.0 sec  0.00 Bytes  0.00 bits/sec
[  4]  5.0- 6.0 sec  0.00 Bytes  0.00 bits/sec
[  4]  6.0- 7.0 sec  0.00 Bytes  0.00 bits/sec
[  4]  7.0- 8.0 sec  0.00 Bytes  0.00 bits/sec
[  4]  8.0- 9.0 sec  0.00 Bytes  0.00 bits/sec
[  4]  9.0-10.0 sec  51.9 MBytes   435 Mbits/sec
[  4] 10.0-11.0 sec  71.2 MBytes   598 Mbits/sec
[  4] 11.0-12.0 sec   111 MBytes   932 Mbits/sec
[  4] 12.0-13.0 sec   109 MBytes   916 Mbits/sec
[  4] 13.0-14.0 sec   109 MBytes   918 Mbits/sec
[  4] 14.0-15.0 sec   110 MBytes   924 Mbits/sec
[  4] 15.0-16.0 sec  80.6 MBytes   676 Mbits/sec
[  4] 16.0-17.0 sec  0.00 Bytes  0.00 bits/sec
[  4] 17.0-18.0 sec  0.00 Bytes  0.00 bits/sec
[  4] 18.0-19.0 sec  0.00 Bytes  0.00 bits/sec
[  4] 19.0-20.0 sec  0.00 Bytes  0.00 bits/sec
[  4]  0.0-20.2 sec   644 MBytes   267 Mbits/sec

We are also able to reproduce when plugging 2 TX2 cards directly.

Searching on nvidia forum we’ve found a similar issue that hasn’t solved:

I would suggest to use “iperf3” . Install it by “sudo apt-get install iperf3” . These commands work for us.
iperf3 -s
iperf3 -c 172.116.150.1 -u -b 1000M

Thanks for you answer and using udp it seems to increase performances but when we return to tcp we’ve got the same problem.

And we need to communicate using TCP not UDP.

We’ve seen that Jetpack 4.5 is available and we have made the update to this version in order to re-run iperf tests.

We got the same issue with the new jetpack…

Any one has any clue to help ?

iperf -c 172.16.150.1 -p 10000 -t 20 -i 1 -r
------------------------------------------------------------
Server listening on TCP port 10000
TCP window size: 1.00 MByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 172.16.150.1, TCP port 10000
TCP window size:  512 KByte (default)
------------------------------------------------------------
[  4] local 172.16.100.100 port 50649 connected with 172.16.150.1 port 10000
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0- 1.0 sec   109 MBytes   914 Mbits/sec
[  4]  1.0- 2.0 sec   110 MBytes   921 Mbits/sec
[  4]  2.0- 3.0 sec   110 MBytes   925 Mbits/sec
[  4]  3.0- 4.0 sec   111 MBytes   931 Mbits/sec
[  4]  4.0- 5.0 sec   111 MBytes   930 Mbits/sec
[  4]  5.0- 6.0 sec   110 MBytes   923 Mbits/sec
[  4]  6.0- 7.0 sec   110 MBytes   925 Mbits/sec
[  4]  7.0- 8.0 sec   111 MBytes   929 Mbits/sec
[  4]  8.0- 9.0 sec   111 MBytes   929 Mbits/sec
[  4]  9.0-10.0 sec   111 MBytes   928 Mbits/sec
[  4] 10.0-11.0 sec   110 MBytes   923 Mbits/sec
[  4] 11.0-12.0 sec   110 MBytes   924 Mbits/sec
[  4] 12.0-13.0 sec   111 MBytes   929 Mbits/sec
[  4] 13.0-14.0 sec   111 MBytes   929 Mbits/sec
[  4] 14.0-15.0 sec   110 MBytes   925 Mbits/sec
[  4] 15.0-16.0 sec   110 MBytes   924 Mbits/sec
[  4] 16.0-17.0 sec   110 MBytes   924 Mbits/sec
[  4] 17.0-18.0 sec   111 MBytes   930 Mbits/sec
[  4] 18.0-19.0 sec   111 MBytes   930 Mbits/sec
[  4] 19.0-20.0 sec   110 MBytes   924 Mbits/sec
[  4]  0.0-20.0 sec  2.16 GBytes   926 Mbits/sec
[  4] local 172.16.100.100 port 10000 connected with 172.16.150.1 port 46358
[  4]  0.0- 1.0 sec  94.3 MBytes   791 Mbits/sec
[  4]  1.0- 2.0 sec  65.7 MBytes   551 Mbits/sec
[  4]  2.0- 3.0 sec  9.36 MBytes  78.5 Mbits/sec
[  4]  3.0- 4.0 sec  9.54 MBytes  80.0 Mbits/sec
[  4]  4.0- 5.0 sec  8.91 MBytes  74.7 Mbits/sec
[  4]  5.0- 6.0 sec  10.0 MBytes  83.9 Mbits/sec
[  4]  6.0- 7.0 sec  9.38 MBytes  78.7 Mbits/sec
[  4]  7.0- 8.0 sec  8.87 MBytes  74.4 Mbits/sec
[  4]  8.0- 9.0 sec  10.0 MBytes  84.0 Mbits/sec
[  4]  9.0-10.0 sec  9.27 MBytes  77.8 Mbits/sec
[  4] 10.0-11.0 sec  12.3 MBytes   104 Mbits/sec
[  4] 11.0-12.0 sec  9.53 MBytes  79.9 Mbits/sec
[  4] 12.0-13.0 sec  10.8 MBytes  90.3 Mbits/sec
[  4] 13.0-14.0 sec  9.92 MBytes  83.2 Mbits/sec
[  4] 14.0-15.0 sec  8.75 MBytes  73.4 Mbits/sec
[  4] 15.0-16.0 sec  9.20 MBytes  77.2 Mbits/sec
[  4] 16.0-17.0 sec  8.42 MBytes  70.6 Mbits/sec
[  4] 17.0-18.0 sec  8.60 MBytes  72.1 Mbits/sec
[  4] 18.0-19.0 sec  9.59 MBytes  80.5 Mbits/sec
[  4] 19.0-20.0 sec  9.34 MBytes  78.4 Mbits/sec
[  4]  0.0-20.4 sec   337 MBytes   138 Mbits/sec
[SUM]  0.0-20.4 sec   431 MBytes   177 Mbits/sec

Could you use a host with RJ45 port to test? Need to clarify if the adapter has something to do with this issue.

Please also try other adapters if possible. And please share me what adapter are you using now.

We have used many USB Ethernet adapters and many laptop in order to exclude USB Ethernet dongles and all our tests provide same results.

Then could you try RJ45 port to see if the result is stable?

If any usb adapters can reproduce your issue, then I will try to get one and try to reproduce your issue later.

What OS is running on laptop side?

We make a try with a “real” RJ45 port and I go back with the tests results.

We try Wit Ubuntu and Windows 10.

Nicolas,
“The problem does not occurs each time. Sometime results are better but sometimes results are very poor…
=> did the normal case (925 Mbits/sec) vs. bad one (155 Mbits/sec) have any pattern during your test? For instance, with a refresh run, you always get a good result first and then bad one afterwards OR it’s very random? As Wayne put it, we will try to see if we could repro the issue but on the other hand, if you have further update, feel free to share. Thanks

So we have mage tests with a “real” ethernet and the problem seems to be less reproductive.

We made another test to illustrate the as near as possible than our issue today on our system.

So we took two evaluation TX2 board, with Jetpack 4.5 on both card, no apt-upgrade ran just stock install.

We link both cards with ethernet cable (we tried crossover cable and right cable).

And we re-run our iperf test. Result are very poor an reproductible. With this test there is no other computer, no usb ethernet dongle so it is esay to reproduce.

We also tried to insert an Ethernet switch between the two TX2 and the problem disappears …

But in our architecture there is no switch between tx2s. So, we really need a solution with only tx2 boards

Are you saying that you can reproduce poor iperf performance if you are connecting two jetson TX2 without usb ethernet cable?

Yes, just two tx2 wired by an ethernet cable, iperf server on the first one, iperf client on the second using command lines provided on the first post.

If it can help in the dmesg we systematically get following line at boot

> dmesg | grep eqos
[ 1.069411] eqos 2490000.ether_qos: can’t get pllrefe_vcoout clk (-2)

Hello,

May I have a estimation of how frequently will you see this issue with 2 TX2 port to port connection case?

We tried to repro this but seems not reproducible.

We have the problem each time. We have tried with many different TX2 boards.
Did you made the tests with two TX2s as shown on the picture and without any Ethernet switch ?

We can repro issue with 2 TX2 case. will check this internally.

Please set CONFIG_EQOS_DISABLE_EEE=y in your tegra_defconfig and rebuild the kernel image.

Let’s see if this can improve the result on your side.

We previously try this option but only in one of the two boards connected.

We make this try as soon as possible using jetpack 4.5 ont both sides.

From your boards did it solve the issue ?

It seems has better result on our side.