We have been using AGX Xavier with 2x10GbE and it has been running well.
Recently, we are testing moving to Orin and we have been noticing issue with packet drops.
We are using mgbe0 and mgbe1, both with MTU configured to be 9000, but the kernel reduce it to 8966, both running 10G mode connected to a switch(the same we have been using with AGX Xavier with (2x X550T), and it worked perfectly)
Whenever the TOTAL traffic exceed ~1GB/s or ~130K packets, which is less than half than 2x10GbE, we start to see huge packet loss.
Inside ethtool -S, we see the following errors keep increasing when the traffic exceed ~1GB/s
mmc_rx_fifo_overflow
rx_buf_unavail_irq_n (2, 4 for mgbe1, and 0, 5 for mgbe0)
Any idea how we can tune the device to improve this?
Whenever the combined bandwidth exceed 1GB/s, i start seeing massive packet drop and mmc_rx_fifo_overflow etc, whether it is from one single 10GbE port or 2; i tried multiple combination, all through one port, or spread the load equally or unequally: all result in massive packet drop exceeding 1GB/s(I see top 7.96Gb on iftop).
Of course i will need to test this on the devkit, but since it only have one 10GbE port, it will be a bit tricky to push it beyond 1GB/s(when the theoritical limit is 10Gb/s aka 1.25GB/s).
I wonder if there is some problem of how kernel is setup that causing the 2 port sharing some limit.
Another thing I noticed is,
when i push the limit over 1GB/s, tegrastats shows 100% CPU usage on CPU0, 12-27-2022 22:31:36 RAM 6827/30538MB (lfb 1783x4MB) SWAP 0/15269MB (cached 0MB) CPU [100%@2201,17%@2201,18%@2201,21%@2201,28%@2201,29%@2201,26%@2201,18%@2201] EMC_FREQ 0% GR3D_FREQ 0% Tdiode@28C Tboard@27C
I remembered seeing it somewhere that all network interrupts can only be handled by CPU0 on jetson platform, is this the cause? is there a way to improve this? we are using Xavier AGX and that works fine, but we have been seeing problems on AGX Orin 32GB.