Packet Drop on Orin mgbe0/1

wsmlby · December 25, 2022, 11:27pm

We have been using AGX Xavier with 2x10GbE and it has been running well.
Recently, we are testing moving to Orin and we have been noticing issue with packet drops.

We are using mgbe0 and mgbe1, both with MTU configured to be 9000, but the kernel reduce it to 8966, both running 10G mode connected to a switch(the same we have been using with AGX Xavier with (2x X550T), and it worked perfectly)

Whenever the TOTAL traffic exceed ~1GB/s or ~130K packets, which is less than half than 2x10GbE, we start to see huge packet loss.

Inside ethtool -S, we see the following errors keep increasing when the traffic exceed ~1GB/s

mmc_rx_fifo_overflow
rx_buf_unavail_irq_n (2, 4 for mgbe1, and 0, 5 for mgbe0)

Any idea how we can tune the device to improve this?

The carrier board we have been using is https://connecttech.com/product/forge-carrier-for-nvidia-jetson-agx-orin/.

WayneWWW · December 26, 2022, 2:46am

Is this error reproducible on the mgbe0 of AGX Orin devkit?

wsmlby · December 26, 2022, 9:10am

Is that also 10gbe? I will test it asap.

WayneWWW · December 26, 2022, 9:12am

Yes, Orin devkit ethernet port is also 10GBE.

wsmlby · December 28, 2022, 6:30am

Ok I did a bit more test and here is the result.

Whenever the combined bandwidth exceed 1GB/s, i start seeing massive packet drop and mmc_rx_fifo_overflow etc, whether it is from one single 10GbE port or 2; i tried multiple combination, all through one port, or spread the load equally or unequally: all result in massive packet drop exceeding 1GB/s(I see top 7.96Gb on iftop).

Of course i will need to test this on the devkit, but since it only have one 10GbE port, it will be a bit tricky to push it beyond 1GB/s(when the theoritical limit is 10Gb/s aka 1.25GB/s).

I wonder if there is some problem of how kernel is setup that causing the 2 port sharing some limit.

wsmlby · December 28, 2022, 6:34am

Another thing I noticed is,
when i push the limit over 1GB/s, tegrastats shows 100% CPU usage on CPU0,
12-27-2022 22:31:36 RAM 6827/30538MB (lfb 1783x4MB) SWAP 0/15269MB (cached 0MB) CPU [100%@2201,17%@2201,18%@2201,21%@2201,28%@2201,29%@2201,26%@2201,18%@2201] EMC_FREQ 0% GR3D_FREQ 0% Tdiode@28C Tboard@27C

I remembered seeing it somewhere that all network interrupts can only be handled by CPU0 on jetson platform, is this the cause? is there a way to improve this? we are using Xavier AGX and that works fine, but we have been seeing problems on AGX Orin 32GB.

wsmlby · December 28, 2022, 6:43am

This is our system version in case this is useful.

R35 (release), REVISION: 1.0, GCID: 31250864, BOARD: t186ref, EABI: aarch64, DATE: Thu Aug 11 03:40:29 UTC 2022

WayneWWW · December 28, 2022, 6:48am

Just want to confirm… is the result you just shared based on your board or devkit?

wsmlby · December 28, 2022, 6:51am

This is still my board. I will test again on devkit soon but I don’t have access to it for a few days.

wsmlby · December 28, 2022, 7:00am

There seems to be a similar problem mentioned here. Is this being tracked?

WayneWWW · December 28, 2022, 7:46am

Could you try to set cpu affinity and see if the interrupt could be routed to other cores?

Examples –
CPU-1: echo 2 > /proc/irq/57/smp_affinity
CPU-2: echo 4 > /proc/irq/58/smp_affinity

wsmlby · December 28, 2022, 8:08am

That actually worked!
I have to set irq affinity for each one to a single CPU, if i assign mask ff, it will just all got back to 0.

wsmlby · December 28, 2022, 7:05pm

It still use a lot of CPU though, is it possible to change the mgbe0-4 into polling mode?(disable interrupts, NAPI)

wsmlby · December 30, 2022, 2:02am

Even bigger problem:
After I set smp_affinity to other cores, I start seeing a huge kernel memory leak ~50MB/s.

top command doesn’t show any process using memory increasingly but “free” command showing an increase of ~50MB/s of “used” memory.

WayneWWW · January 3, 2023, 2:49am

Hi,

Just want to clarify how things work here. Is your issue able to get reproduced on AGX Orin devkit?

We can only check issue on devkit.

system · January 25, 2023, 2:58am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
High CPU interrupts limit 10G port's performance to ~5.3Gb/s Jetson AGX Orin ethernet	7	1977	June 15, 2022
Reduced bandwidth at 10Gbps on Orin and mgbe_payload_cs_err correlation Jetson AGX Orin networking	6	1221	September 13, 2023
10G ethernet testing of Jetson AGX Orin Developer Kit Jetson AGX Orin nvbugs , ethernet	4	3680	November 8, 2022
Orin Unable to Reach 10 Gbps Networking Jetson AGX Orin ethernet , networking	9	2470	September 8, 2023
Package loss on bandwidth reduced network - AGX Orin Jetson AGX Orin ethernet , networking	20	1997	April 11, 2023
10GbE PCIe Card Behaves like 1GbE Card Jetson AGX Xavier	25	3259	October 18, 2021
Port ethernet performance Jetson TX2	19	2065	October 18, 2021
How to maximize bitrate on the jetson orin Jetson AGX Orin networking	10	993	October 26, 2022
Jetson AGX Orin ethernet MTU question Jetson AGX Orin ethernet	11	1674	February 8, 2023
tx2/Elroy/MPG104 bandwidth problem Jetson TX2	11	758	October 18, 2021

Packet Drop on Orin mgbe0/1

R35 (release), REVISION: 1.0, GCID: 31250864, BOARD: t186ref, EABI: aarch64, DATE: Thu Aug 11 03:40:29 UTC 2022

Related topics