Multiple prx_desc errors on TX2 ethernet

Hello,

I encountered performance problems and errors when I’m receiving bigger amount of data (more than approx 80 MB/s) on integrated ethernet on TX2 (ether_eqos).

Multiple messages in kernel log appear:

ubuntu@tegra-ubuntu:~$ dmesg

[...]

[ 5295.529451] 
               prx_desc[00 ffffff800ea7d9a0 154 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.529474] 
               prx_desc[00 ffffff800ea7d9b0 155 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.529502] 
               prx_desc[00 ffffff800ea7d9c0 156 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.529531] 
               prx_desc[00 ffffff800ea7d9d0 157 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.529552] 
               prx_desc[00 ffffff800ea7d9e0 158 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.529574] 
               prx_desc[00 ffffff800ea7d9f0 159 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.529603] 
               prx_desc[00 ffffff800ea7da00 160 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.529624] 
               prx_desc[00 ffffff800ea7da10 161 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.529649] 
               prx_desc[00 ffffff800ea7da20 162 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.529695] 
               prx_desc[00 ffffff800ea7da30 163 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.530141] 
               prx_desc[00 ffffff800ea7da40 164 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.530169] 
               prx_desc[00 ffffff800ea7da50 165 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.530199] 
               prx_desc[00 ffffff800ea7da60 166 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.530228] 
               prx_desc[00 ffffff800ea7da70 167 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.530255] 
               prx_desc[00 ffffff800ea7da80 168 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.530283] 
               prx_desc[00 ffffff800ea7da90 169 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.530306] 
               prx_desc[00 ffffff800ea7daa0 170 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.530329] 
               prx_desc[00 ffffff800ea7dab0 171 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.530397] 
               prx_desc[00 ffffff800ea7dac0 172 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.530460] 
               prx_desc[00 ffffff800ea7dae0 174 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.530484] 
               prx_desc[00 ffffff800ea7db00 176 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.530507] 
               prx_desc[00 ffffff800ea7db10 177 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.530532] 
               prx_desc[00 ffffff800ea7db20 178 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.530562] 
               prx_desc[00 ffffff800ea7db30 179 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.531075] 
               prx_desc[00 ffffff800ea7db40 180 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.531100] 
               prx_desc[00 ffffff800ea7db50 181 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.531128] 
               prx_desc[00 ffffff800ea7db60 182 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000
[ 5295.531211] 
               prx_desc[00 ffffff800ea7db70 183 RECEIVED FROM DEVICE] = 0x0:0x0:0x0:0x30208000

There are also errors and overruns in ifconfig output.

ubuntu@tegra-ubuntu:~$ ifconfig 
eth0      Link encap:Ethernet  HWaddr 00:04:4b:8d:46:b5  
          inet addr:10.0.32.3  Bcast:10.0.47.255  Mask:255.255.240.0
          inet6 addr: fe80::204:4bff:fe8d:46b5/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
          RX packets:8575072 errors:3524 dropped:0 overruns:3524 frame:0
          TX packets:89727 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:76345646337 (76.3 GB)  TX bytes:5246902 (5.2 MB)
          Interrupt:42

Interesting is that much fewer errors occur when the interface is monitored (i.e. is in promiscuous mode).

So far I’ve been able to learn the kernel message printed by eqos driver (source code from https://developer.nvidia.com/embedded/dlc/sources-r2821
under kernel-4.4/drivers/net/ethernet/nvidia/eqos) is performed in drv.c (dump_rx_desc call):

static int process_rx_completions(struct eqos_prv_data *pdata,
                                  int quota, UINT qinx)
{
[...]
                        if (!(prx_desc->rdes3 & err_bits) &&
                             (prx_desc->rdes3 & EQOS_RDESC3_LD)) {

[...]

                        } else {
                                dump_rx_desc(qinx, prx_desc,
                                             prx_ring->cur_rx);
                                if (!(prx_desc->rdes3 & EQOS_RDESC3_LD))
                                        pr_debug("Received oversized pkt,"
                                              "spanned across multiple desc\n");

                                /* recycle skb */
                                prx_swcx_desc->skb = skb;
                                dev->stats.rx_errors++;
                                eqos_update_rx_errors(dev,
                                                      prx_desc->rdes3);
                        }
[...]

Macros for prx_desc->rdes3 are defined in yheader.h (EQOS_RDESC3_*), so for this case we have rdes3 == 0x30208000 which stands for EQOS_RDESC3_FD | EQOS_RDESC3_FD | EQOS_RDESC3_OF | EQOS_RDESC3_ES.
Unfortunately these macros aren’t described in the header, so I couldn’t analyse the errors more.

Is there anyone who could help with fixing this issue?

jaroslav.beran,

Thanks for reporting issue. Unfortunately, eqos has some 3rd party IP so that we cannot release it in TRM for you to do further debug.

Please share the steps with us to reproduce this issue on nvidia devkit and we can investigate it.

Thank you for reply.

The simplest way to reproduce this behaviour is to continuosly transfer data using e.g. netcat utility.

  1. Connect Jetson with a PC using ethernet to the same network.

  2. On Jetson, open an UDP port for listening:

ubuntu@tegra-ubuntu:~$ netcat -u -l 9999 > /dev/null
  1. On the PC, send data to Jetson’s interface:
user@pc:~$ cat /dev/zero | netcat -u 10.0.32.3 9999
  1. On Jetson, observe kernel log (dmesg) and errors/overruns on network interface statistics (ifconfig)

Note: When roles of Jetson and PC are opposite, i.e. PC is receiving and Jetson is sending the data, no errors occur. I observed these errors on RX, not TX.

Hi there, we are seeing the same issues. Details are here:
https://devtalk.nvidia.com/default/topic/1046870/jetson-tx2/onboard-ethernet-causing-100-kernel-failed-to-allocate-skb/

Hi there, we are seeing the same issues.

Out of interest what is your MTU size?

9000

We believe this is an issue with the driver. When the memory becomes fragmented the kernel is allowed to reject allocation requests. Large requests are more likely to be rejected than small requests.

If the request is rejected the driver is meant to ask for a smaller chunk or wait.

We speculate that the current driver ignores this error and hence the problems.

An MTU size of 9000 requires a contiguous block of 16384 to be allocated by the kernel. These run out quite quickly when an application churns through lots of memory, allocating a deallocating.

An MTU size of say 8000 requires a contiguous block of 8192 and there tend to be more of these available. An MTU size of 4192 even better etc.

It’s not to do with being out of memory, but being out of contiguous chunks of memory due to fragmentation.

cat /proc/buddyinfo tells you how many of each chunk size is available.

The correct resolution would be for someone to fix the driver (the code is proprietary we believe).

The only other work around is a smaller MTU size or use a different ethernet adapter.

Sorry that I missed this issue.

Jon,

Could you share the steps here and the release revision you are using?
The issue is reproduced on devkit, right?

I have to restore MTU to 1500, thanks.