Multiple prx_desc errors on TX2 ethernet

jaroslav.beran · July 18, 2018, 2:57pm

Hello,

I encountered performance problems and errors when I’m receiving bigger amount of data (more than approx 80 MB/s) on integrated ethernet on TX2 (ether_eqos).

Multiple messages in kernel log appear:

ubuntu@tegra-ubuntu:~$ dmesg

[...]

[ 5295.529451] 
               prx_desc[00 ffffff800ea7d9a0 154 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.529474] 
               prx_desc[00 ffffff800ea7d9b0 155 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.529502] 
               prx_desc[00 ffffff800ea7d9c0 156 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.529531] 
               prx_desc[00 ffffff800ea7d9d0 157 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.529552] 
               prx_desc[00 ffffff800ea7d9e0 158 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.529574] 
               prx_desc[00 ffffff800ea7d9f0 159 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.529603] 
               prx_desc[00 ffffff800ea7da00 160 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.529624] 
               prx_desc[00 ffffff800ea7da10 161 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.529649] 
               prx_desc[00 ffffff800ea7da20 162 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.529695] 
               prx_desc[00 ffffff800ea7da30 163 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.530141] 
               prx_desc[00 ffffff800ea7da40 164 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.530169] 
               prx_desc[00 ffffff800ea7da50 165 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.530199] 
               prx_desc[00 ffffff800ea7da60 166 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.530228] 
               prx_desc[00 ffffff800ea7da70 167 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.530255] 
               prx_desc[00 ffffff800ea7da80 168 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.530283] 
               prx_desc[00 ffffff800ea7da90 169 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.530306] 
               prx_desc[00 ffffff800ea7daa0 170 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.530329] 
               prx_desc[00 ffffff800ea7dab0 171 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.530397] 
               prx_desc[00 ffffff800ea7dac0 172 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.530460] 
               prx_desc[00 ffffff800ea7dae0 174 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.530484] 
               prx_desc[00 ffffff800ea7db00 176 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.530507] 
               prx_desc[00 ffffff800ea7db10 177 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.530532] 
               prx_desc[00 ffffff800ea7db20 178 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.530562] 
               prx_desc[00 ffffff800ea7db30 179 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.531075] 
               prx_desc[00 ffffff800ea7db40 180 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.531100] 
               prx_desc[00 ffffff800ea7db50 181 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.531128] 
               prx_desc[00 ffffff800ea7db60 182 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.531211] 
               prx_desc[00 ffffff800ea7db70 183 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000

There are also errors and overruns in ifconfig output.

ubuntu@tegra-ubuntu:~$ ifconfig 
eth0      Link encap:Ethernet  HWaddr 00:04:4b:8d:46:b5  
          inet addr:10.0.32.3  Bcast:10.0.47.255  Mask:255.255.240.0
          inet6 addr: fe80::204:4bff:fe8d:46b5/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
          RX packets:8575072 errors:3524 dropped:0 overruns:3524 frame:0
          TX packets:89727 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:76345646337 (76.3 GB)  TX bytes:5246902 (5.2 MB)
          Interrupt:42

Interesting is that much fewer errors occur when the interface is monitored (i.e. is in promiscuous mode).

So far I’ve been able to learn the kernel message printed by eqos driver (source code from https://developer.nvidia.com/embedded/dlc/sources-r2821
under kernel-4.4/drivers/net/ethernet/nvidia/eqos) is performed in drv.c (dump_rx_desc call):

static int process_rx_completions(struct eqos_prv_data *pdata,
                                  int quota, UINT qinx)
{
[...]
                        if (!(prx_desc->rdes3 & err_bits) &&
                             (prx_desc->rdes3 & EQOS_RDESC3_LD)) {

[...]

                        } else {
                                dump_rx_desc(qinx, prx_desc,
                                             prx_ring->cur_rx);
                                if (!(prx_desc->rdes3 & EQOS_RDESC3_LD))
                                        pr_debug("Received oversized pkt,"
                                              "spanned across multiple desc\n");

                                /* recycle skb */
                                prx_swcx_desc->skb = skb;
                                dev->stats.rx_errors++;
                                eqos_update_rx_errors(dev,
                                                      prx_desc->rdes3);
                        }
[...]

Macros for prx_desc->rdes3 are defined in yheader.h (EQOS_RDESC3_*), so for this case we have rdes3 == 0x30208000 which stands for EQOS_RDESC3_FD | EQOS_RDESC3_FD | EQOS_RDESC3_OF | EQOS_RDESC3_ES.
Unfortunately these macros aren’t described in the header, so I couldn’t analyse the errors more.

Is there anyone who could help with fixing this issue?

WayneWWW · July 19, 2018, 2:06am

jaroslav.beran,

Thanks for reporting issue. Unfortunately, eqos has some 3rd party IP so that we cannot release it in TRM for you to do further debug.

Please share the steps with us to reproduce this issue on nvidia devkit and we can investigate it.

jaroslav.beran · July 19, 2018, 1:39pm

Thank you for reply.

The simplest way to reproduce this behaviour is to continuosly transfer data using e.g. netcat utility.

Connect Jetson with a PC using ethernet to the same network.
On Jetson, open an UDP port for listening:

ubuntu@tegra-ubuntu:~$ netcat -u -l 9999 > /dev/null

On the PC, send data to Jetson’s interface:

user@pc:~$ cat /dev/zero | netcat -u 10.0.32.3 9999

On Jetson, observe kernel log (dmesg) and errors/overruns on network interface statistics (ifconfig)

Note: When roles of Jetson and PC are opposite, i.e. PC is receiving and Jetson is sending the data, no errors occur. I observed these errors on RX, not TX.

jon.richardsm6bq8 · January 30, 2019, 3:53pm

Hi there, we are seeing the same issues. Details are here:
https://devtalk.nvidia.com/default/topic/1046870/jetson-tx2/onboard-ethernet-causing-100-kernel-failed-to-allocate-skb/

jiakai1000 · July 17, 2019, 11:57am

Hi there, we are seeing the same issues.

jon.richardsm6bq8 · July 17, 2019, 12:00pm

Out of interest what is your MTU size?

jiakai1000 · July 17, 2019, 12:03pm

9000

jon.richardsm6bq8 · July 17, 2019, 12:14pm

We believe this is an issue with the driver. When the memory becomes fragmented the kernel is allowed to reject allocation requests. Large requests are more likely to be rejected than small requests.

If the request is rejected the driver is meant to ask for a smaller chunk or wait.

We speculate that the current driver ignores this error and hence the problems.

An MTU size of 9000 requires a contiguous block of 16384 to be allocated by the kernel. These run out quite quickly when an application churns through lots of memory, allocating a deallocating.

An MTU size of say 8000 requires a contiguous block of 8192 and there tend to be more of these available. An MTU size of 4192 even better etc.

It’s not to do with being out of memory, but being out of contiguous chunks of memory due to fragmentation.

cat /proc/buddyinfo tells you how many of each chunk size is available.

The correct resolution would be for someone to fix the driver (the code is proprietary we believe).

The only other work around is a smaller MTU size or use a different ethernet adapter.

WayneWWW · July 18, 2019, 2:52am

Sorry that I missed this issue.

Jon,

Could you share the steps here and the release revision you are using?
The issue is reproduced on devkit, right?

jiakai1000 · July 22, 2019, 6:57am

We believe this is an issue with the driver. When the memory becomes fragmented the kernel is allowed to reject allocation requests. Large requests are more likely to be rejected than small requests.

If the request is rejected the driver is meant to ask for a smaller chunk or wait.

We speculate that the current driver ignores this error and hence the problems.

An MTU size of 9000 requires a contiguous block of 16384 to be allocated by the kernel. These run out quite quickly when an application churns through lots of memory, allocating a deallocating.

An MTU size of say 8000 requires a contiguous block of 8192 and there tend to be more of these available. An MTU size of 4192 even better etc.

It’s not to do with being out of memory, but being out of contiguous chunks of memory due to fragmentation.

cat /proc/buddyinfo tells you how many of each chunk size is available.

The correct resolution would be for someone to fix the driver (the code is proprietary we believe).

The only other work around is a smaller MTU size or use a different ethernet adapter.

I have to restore MTU to 1500, thanks.

Topic		Replies	Views
Jetson TX2 crashing due to Ethernet connection Jetson TX2 kernel , ethernet	34	2711	November 17, 2021
Networking performance issue Jetson TX2 ethernet	20	3002	October 18, 2021
Port ethernet performance Jetson TX2	19	2267	October 18, 2021
Ethernet Connectivity Issues Jetson TX2	27	7715	July 31, 2020
tx2/Elroy/MPG104 bandwidth problem Jetson TX2	11	856	October 18, 2021
TX1 and TX2 1G ethernet support jumbo frame Jetson TX2	2	1185	October 18, 2021
Jetson TX2 mmc_rx_fifo_overflow errors Jetson TX2 ethernet , networking	4	1553	December 8, 2021
mmc_rx_fifo_overflow of ethernet Jetson TX2	11	1267	October 18, 2021
TX2 (and TX1) network problems when used with two NICs Jetson TX2	1	1242	December 19, 2017
Onboard ethernet causing 100% kernel, Failed to allocate skb Jetson TX2	1	901	January 31, 2019

Multiple prx_desc errors on TX2 ethernet

Related topics