Network Packet Loss for Xavier NX vs Orin NX

During the comparison test between Orin and Xavier network testing, I see that all the network traffic is routed through CPU core 1 of Xavier and Orin. One difference I see is that the CPU percentage being used is different for both (more for Orin).

  1. Xavier without stress test, sending ~930 mb uses about 48 % of CPU core 1.

  2. Xavier with stress test sending ~930 mb uses 48 % of the CPU core 1 for the data to be sent, but uses the rest of the core for other operations (stress test for this case).

3. Orin without stress test, sending ~930 mb uses about 92% (sometimes more) of CPU core 1.

4. Orin with stress test, sending ~930 mb uses about 92% (sometimes more) of CPU core 1 but uses the rest of the core for other operations (stress test for this case). When I run the stress test, about 4.6% is allocated for this.

The problem statement:
Orin NX drops packet over ethernet for about 5-9 seconds. This occurrence is random and just happens every few minutes. Can happen every few minutes to once in 60 minutes. At the drop, the terminal with Orin will freeze. If I have opened a terminal with jtop/htop all the numbers will freeze for the duration of the ethernet drop.
In the images attached, I use a iperf command to send data from an Orin and from a Xavier. As the images suggest Orin uses 92% of the CPU1 vs 45% for Xavier.

I need your help to understand if for instance the Orin is using CPU 1 for some other processes at that instant, and is not able to find the relevant CPU threads, will it start dropping the network connections?

This might be related, and will help explain at least part of it…

Whenever a kernel driver is to be run an interrupt occurs. The scheduler sees this, and based on scheduling rules, picks when to run a driver. To some extent, the scheduler is also responsible for choosing which CPU core to use (but that’s not the whole story).

It is important to understand that there are two types of interrupts (IRQs). One is a software IRQ, and these do not talk to hardware on physical addresses; they are purely software, e.g., calculating a checksum in software could be a soft IRQ. The part of the kernel which schedules and distributes software IRQs is ksoftirqd.

Note that a soft IRQ might pick any CPU core, but due to having an understanding of cache hits (versus cache misses) the scheduler will tend to apply pressure to keep a given PID (or thread ID) on the same core (switching cores guarantees a loss of performance via cache misses).

When we are dealing with hardware drivers (a hardware IRQ), actual physical addresses must be used. In addition to this, an actual wire has to exist between the CPU core and the hardware. On an AMD style desktop PC there is an I/O APIC (asynchronous programmable I/O controller). That APIC can have its programming changed to route a hardware IRQ to any core, and has the correct wiring for that. There will be more cases of needing atomic code sections for a hardware driver than for a soft IRQ. One example of this is that if you want to pause an I/O transfer from one part of memory to another, then you can easily find ways to pause this and restart it later. Another example of when this cannot be done might be the same sort of I/O request between memory and the hard drive; the hard drive might not have the ability to pause in the middle of some sort of memory request the way a pure software program might.

It is considered good practice to take any hardware driver and divide operations which must be performed with the hardware from related operations which can be performed later in software. An example would be that a driver for an ethernet buffer, when the hardware IRQ is triggered, really must be performed in an atomic manner; however, if the checksum for traffic is not in hardware, then instead of putting the checksum code in the hardware driver, a separate software driver would be implemented, and the hardware driver would trigger a software IRQ. It might also be possible that if the checksum is in hardware, then perhaps this would be in two separate drivers, and each would be triggered by the hardware IRQ wire (each half could theoretically then run on separate cores).

I believe this is still the case with Orins, that much of the hardware has wiring only for CPU0 (CPU1 in top/jtop/htop). Every CPU of course has its own timers and memory controller access, and perhaps groups of GPIO could be changed to some other core, but for the most part hardware interrupts only have wiring for CPU0. Take a look at the output of:
cat /proc/interrupts

Those are hardware IRQs. Not everything running on CPU0 must be on CPU0 (the scheduler might not know to offload some software to other cores), but much of it must be on CPU0 due to lack of hardware IRQ wiring (I hope someone from NVIDIA might be able to say which hardware can go to any core).

For the purpose of efficiency, if you know of a software process running on CPU0, you might want to set affinity to another core. If the software runs on entirely one core, then there shouldn’t be any performance hit from cache misses (except for perhaps the first time the process runs, or if other processes are competing for cache).

Note that if you set a hardware IRQ to a core which cannot succeed due to wiring, that the scheduler will still try to go there. You might see the IRQ on that core. But then you’ll also see it back on CPU0 because the scheduler would eventually recognize that the other core isn’t available.

Each hardware IRQ takes a certain amount of time. Eventually, as CPU0 load goes up, and as more hardware IRQs hit that core, the CPU will eventually not have time to service all IRQs. This is known as “interrupt starvation”. About the only solution is to try to put all of the software drivers onto a different core. This might include rewriting an inefficient hardware IRQ and splitting off to a mix of hardware IRQ and software IRQ, with the software IRQ going to another core (which could be inefficient so far as cache goes, but it is better than completely losing a hardware IRQ).

I don’t think the hardware IRQ for ethernet can be offloaded to a different core. Someone from NVIDIA would have to say if that is possible (it might be now, but it wasn’t the case on earlier Jetsons).

1 Like

Thank you for your reply. The pointers you have written do make sense. I would definitely wait for someone from Nvidia to help solve this. At the same time, I would like to implement the load-balancing and offloading from CPU0, so that I can keep the communication stable.

Do you happen to know a way on Nvidia devices where I can specify that the applications being run should not use CPU0 but the other 7 cores.

Hi,
If you use Ethernet, please try this:
How to remove r8169 linux kernel module - #3 by DaveYYY

It seems to be certain issue in r8169 driver. Please use r8168 and give it a try.

I would first try what @DaneLLL suggests. It is possible that the other driver is optimized differently, or just changes timing such that it is more efficient under load.

In terms of changing which core the ethernet runs on, I don’t think it can be done (I believe that if the commands to force to another core succeeds, then the scheduler will just move it back to the original core after checking the desired core and finding it cannot go there; it wouldn’t hurt to try though).

For other software which is not dependent on CPU0, if you find something with significant CPU load, then you can look up the topics of “taskset” and “CPU affinity”. See:

Hi DaneLLL,
Looks like the frequency of the error has reduced after the fix that you shared but it is still present and is causing Loss of Communication. When I check my dmesg logs, I get this error. -

Hi,
Please share the method so that we can set up and try to reproduce the error. The developer kit for Orin NX is Orin NX module + Orin Nano carrier board. We can set up this platform and try.

Hello DaneLLL,
I have seen this error occur when there was about ~10 mb of data being sent and ~45 mb of data being received. Orin was doing it normal operation (CPU load ~60%, Temp ~35 C) and communicating with other Orins on the network.
I saw this error on this forum post as well - “NETDEV WATCHDOG: eth0 (r8168): transmission queue 0 timed out” and upstream kernel

Hi @sanket193
Would need your help to share clear steps for replicating the issue on developer kit. So that we can set up and try.

We have been testing two environments, A custom PCB and Orin NX+Xavier devkit
After changing the driver to r8168, the loss of communication has reduced significantly on the orin nx+xavier devkit combination.

Although the custom PCB still frequently sees the loss of communication from the Orin with multiple errors on the r8168 driver. Possibly some error in the interaction of Orin NX with the PCB. I will be able to share more details shortly.

Hi, it looks more like your custom PCB layout quality is not good enough because of the fail rate difference b/w your board and devkit. Can you share the schematic of this part on your board? You should also check your PCB routing based on the chapter Ethernet MDI Routing Guidelines in Orin NX Design Guide doc.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.