Packet Drop on Orin mgbe0/1

We have been using AGX Xavier with 2x10GbE and it has been running well.
Recently, we are testing moving to Orin and we have been noticing issue with packet drops.

We are using mgbe0 and mgbe1, both with MTU configured to be 9000, but the kernel reduce it to 8966, both running 10G mode connected to a switch(the same we have been using with AGX Xavier with (2x X550T), and it worked perfectly)

Whenever the TOTAL traffic exceed ~1GB/s or ~130K packets, which is less than half than 2x10GbE, we start to see huge packet loss.

Inside ethtool -S, we see the following errors keep increasing when the traffic exceed ~1GB/s

  1. mmc_rx_fifo_overflow
  2. rx_buf_unavail_irq_n (2, 4 for mgbe1, and 0, 5 for mgbe0)

Any idea how we can tune the device to improve this?

The carrier board we have been using is https://connecttech.com/product/forge-carrier-for-nvidia-jetson-agx-orin/.

Is this error reproducible on the mgbe0 of AGX Orin devkit?

Is that also 10gbe? I will test it asap.

Yes, Orin devkit ethernet port is also 10GBE.

Ok I did a bit more test and here is the result.

Whenever the combined bandwidth exceed 1GB/s, i start seeing massive packet drop and mmc_rx_fifo_overflow etc, whether it is from one single 10GbE port or 2; i tried multiple combination, all through one port, or spread the load equally or unequally: all result in massive packet drop exceeding 1GB/s(I see top 7.96Gb on iftop).

Of course i will need to test this on the devkit, but since it only have one 10GbE port, it will be a bit tricky to push it beyond 1GB/s(when the theoritical limit is 10Gb/s aka 1.25GB/s).

I wonder if there is some problem of how kernel is setup that causing the 2 port sharing some limit.

Another thing I noticed is,
when i push the limit over 1GB/s, tegrastats shows 100% CPU usage on CPU0,
12-27-2022 22:31:36 RAM 6827/30538MB (lfb 1783x4MB) SWAP 0/15269MB (cached 0MB) CPU [100%@2201,17%@2201,18%@2201,21%@2201,28%@2201,29%@2201,26%@2201,18%@2201] EMC_FREQ 0% GR3D_FREQ 0% Tdiode@28C Tboard@27C

I remembered seeing it somewhere that all network interrupts can only be handled by CPU0 on jetson platform, is this the cause? is there a way to improve this? we are using Xavier AGX and that works fine, but we have been seeing problems on AGX Orin 32GB.

This is our system version in case this is useful.

R35 (release), REVISION: 1.0, GCID: 31250864, BOARD: t186ref, EABI: aarch64, DATE: Thu Aug 11 03:40:29 UTC 2022

Just want to confirm… is the result you just shared based on your board or devkit?

This is still my board. I will test again on devkit soon but I don’t have access to it for a few days.

There seems to be a similar problem mentioned here. Is this being tracked?

Could you try to set cpu affinity and see if the interrupt could be routed to other cores?

Examples –
CPU-1: echo 2 > /proc/irq/57/smp_affinity
CPU-2: echo 4 > /proc/irq/58/smp_affinity

That actually worked!
I have to set irq affinity for each one to a single CPU, if i assign mask ff, it will just all got back to 0.

1 Like

It still use a lot of CPU though, is it possible to change the mgbe0-4 into polling mode?(disable interrupts, NAPI)

Even bigger problem:
After I set smp_affinity to other cores, I start seeing a huge kernel memory leak ~50MB/s.

top command doesn’t show any process using memory increasingly but “free” command showing an increase of ~50MB/s of “used” memory.

Hi,

Just want to clarify how things work here. Is your issue able to get reproduced on AGX Orin devkit?

We can only check issue on devkit.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.