Eth loss packet

Jetson AGX Orin

JetPack 5.1.2

Problem description:

The interruption distribution of the network port is uneven,This may result in a probability of packet loss。
cat /proc/interrupts:

like the image, The number of interrupts on CPU2 is much higher than on other cores. It is also possible that another kernel has a much higher number of interrupts, which is random.
RPS as follow:

I tried to modify rps_cpus and rps_flow_cnt values, but after completing the modifications, there was no effect.

thanks

Hi 837535053,

Are you using the devkit or custom board for AGX Orin?

Is MSI ******* your custom network device since I cannot find it on AGX Orin devkit?

Please refer to IRQ Balancing - #6 by sumitg to configure the other cores to handle the interrupt.

$ sudo su
# cd /proc/irq/331
# cat smp_affinity
# cat smp_affinity_list
# echo ff > smp_affinity

Please also refer to R36.3 Patch to re-enable GICv2m for PCIe MSI interrupts and restore I/O performance - Jetson & Embedded Systems / Jetson AGX Orin - NVIDIA Developer Forums and check if those patches can help in your case.

HI。

This uses our self-developed motherboard and NVIDIA ORIN module.

I try two method as follow:

  1. Automatically balancing interrupts with irqbalance

  2. Manually configure interrupt distribution using the method you described


    the effective num is not change.

But neither method works, it feels like only one CPU’s interrupt count accumulates for each interrupt label over a long period of time。

thanks

For modern NICs that support RSS the driver usually allocates one receive queue per CPU core and the interrupt affinity is set to bind each queue to a specific CPU (as in your first image).

Received packets are directed by the NIC hardware to a queue based on a hash of the packet headers. So the distribution of load across the CPUs depends on the number of flows being received and the relative number of packets in each flow.

RPS won’t help unless you have more CPUs than queues, or your NIC doesn’t support RSS.

Looks like you may be trying to receive a lot of data from a single connection - this will all go to a single queue and therefore one CPU core. If that CPU core is 100% busy then not a lot you can do about it other than increase the packet size to reduce CPU overhead, and ensure any hardware offloading features supported by the NIC are enabled.

See https://www.kernel.org/doc/Documentation/networking/scaling.txt

My patch for R36.3 referred to above won’t help as this code is already present in JetPack 5.1.2, but you’d definitely need it if you ever update to JetPack 6. Without the patch the interrupts for all the queues would be handled on CPU #0

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.