Hi all,
I’m working on optimising the throughput of the 10GBASE-T1 link between an Orin p3663-a01
board and a separate x86_64 system. This is a 3rd-party dual-Orin system, but the question relates to the nvethernet
driver which is just the standard DRIVE version. I’ve come across an odd issue. A simple iperf3
between the Orin and the other host produces ~5gbit/s when sending from the Orin to the host:
$ iperf3 -c 192.168.1.2
[ ID] Interval Transfer Bitrate Retr Cwnd
---
[ 5] 0.00-10.00 sec 5.59 GBytes 4.80 Gbits/sec 87 sender
[ 5] 0.00-10.00 sec 5.58 GBytes 4.79 Gbits/sec receiver
But ~8.5Gbit/s receiving from the host to the Orin.
$ iperf3 -c 192.168.1.2 -R
[ ID] Interval Transfer Bitrate Retr
---
[ 5] 0.00-10.00 sec 10.0 GBytes 8.59 Gbits/sec 1 sender
[ 5] 0.00-10.00 sec 10.0 GBytes 8.59 Gbits/sec receiver
The cause seems to be in the queue allocation. The nvethernet
driver creates 10 rx
and tx
queues to handle incoming frames, but only the first tx
queue is being used:
$ ethtool -S mgbe0_0 | grep q_.*_pkt
q_tx_pkt_n[0]: 57169958
q_tx_pkt_n[1]: 0
q_tx_pkt_n[2]: 9
q_tx_pkt_n[3]: 0
q_tx_pkt_n[4]: 0
q_tx_pkt_n[5]: 0
q_tx_pkt_n[6]: 4082
q_tx_pkt_n[7]: 22
q_tx_pkt_n[8]: 0
q_tx_pkt_n[9]: 0
q_rx_pkt_n[0]: 5431386
q_rx_pkt_n[1]: 9402526
q_rx_pkt_n[2]: 1139638
q_rx_pkt_n[3]: 25820203
q_rx_pkt_n[4]: 1210313
q_rx_pkt_n[5]: 7718919
q_rx_pkt_n[6]: 16913829
q_rx_pkt_n[7]: 14053080
q_rx_pkt_n[8]: 2834541
q_rx_pkt_n[9]: 18741537
Running parallel iperf3
threads with -P10
doesn’t appear to make any difference to this behaviour. top
on an otherwise idle system shows a single CPU saturated with system and interrupt load while transmitting:
top - 16:20:18 up 1 day, 2:31, 2 users, load average: 0.74, 1.23, 1.33
Tasks: 724 total, 2 running, 722 sleeping, 0 stopped, 0 zombie
%Cpu0 : 0.0 us, 19.2 sy, 0.0 ni, 12.9 id, 0.0 wa, 1.2 hi, 66.8 si, 0.0 st
%Cpu1 : 0.0 us, 2.7 sy, 0.0 ni, 97.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
[other 10 cores idle]
So the tx
bottleneck appears to be interrupt overload on a single core from that one queue. Is this expected behaviour from the driver and, if not, do I need to change the driver to make full use of 10 queues?
Some details:
$ethtool -i mgbe0_0
driver: nvethernet
version: 5.10.120-rt70-tegra
firmware-version:
expansion-rom-version:
bus-info: 6810000.ethernet
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no
$uname -a
Linux hostname 5.10.120-rt70-tegra #1 SMP PREEMPT RT Fri May 26 11:33:37 CST 2023 aarch64 aarch64 aarch64 GNU/Linux
$ethtool -k mgbe0_0 | grep -v fixed
Features for mgbe0_0:
rx-checksumming: on
tx-checksumming: on
tx-checksum-ipv4: on
tx-checksum-ipv6: on
scatter-gather: on
tx-scatter-gather: on
tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp-mangleid-segmentation: off
generic-segmentation-offload: on
generic-receive-offload: on
rx-vlan-offload: on
tx-vlan-offload: on
receive-hashing: on
rx-vlan-filter: on
tx-udp-segmentation: on
tx-nocache-copy: off
rx-gro-list: off
Please provide the following info (tick the boxes after creating this topic):
Software Version
[/] DRIVE OS 6.0.5
Target Operating System
[/] Linux
Hardware Platform
[/] other
SDK Manager Version
[/] other
Host Machine Version
[/] other (Ubuntu 20.04)