Ethernet Port - Strange/Unusual latency spikes pattern

StackingOverflows · October 17, 2024, 7:09pm

Hi,

We are currently implementing an ethercat master through the ethernet port and we are seeing a strange behavior.

We use socket with raw packets to send data and receive it.
We send data at 0.8ms interval, but the problem happens at any interval.
When we send the data, the first device on the line sends back its accurate clock timestamp and we adjust our real-time loop according to it. The loop runs on an isolated core at priority 49 and is precise with a deviation of 150ns. We also brought the ethernet IRQ to the same isolated core to have better result.

However, there’s a pattern with the Ethernet port on the Orin NX and Nano that we see where the packet will be sent late, then the next packet will be sent early. It looks like something related to a kernel buffer? The send() command from socket.h is non-blocking ; it passes everything to the kernel.

We tried the code on an Orin Ethernet port, on a LAN7430 attached to a PCI lane, and a Raspberry Pi 4.
There’s only on the Orin’s ethernet port that the communication behaves in a strange manner. We tested on multiple Orin devices, and multiple ethercat slaves.

Here are graphs that shows the difference between the last received timestamp from the slave to the devices we tested on.

Is there anything I can tune on the kernel side to mitigate this problem? Is there any test I can run for you that would be able to help?

The problem occurs with the default devkit image, with and without the real-time patch.

Thank you.

linuxdev · October 17, 2024, 7:38pm

How do you mean “priority 49”? If you use the “nice” or “renice” to change a priority, then the range is -20 to +20, and more positive is lower priority. What is your exact method of dealing with priority?

Also, you should probably know a bit about routing of hardware interrupts versus CPU core. What is the isolated core you speak of? Is this one of the Jetson cores using affinity? If so, then this is likely not working the way you think it is.

On a desktop PC, if you were to set affinity for a hardware device which is serviced by a hardware interrupt request, then an actual wire would be set up to associate that interrupt with the core. I will emphasize, this is a physical wire. The programming on an Intel PC for IRQ changes is via the I/O APIC (Asynchronous Programmable Interrupt Controller). AMD has its own way of doing the equivalent. The Jetson has no such mechanism.

Note: /proc/interrupts is not a real file, it is a living reflection in RAM from a driver in the kernel, and it will update as activity runs. When you use “cat” or “less” on that file you freeze a snapshot, but the file is still changing.

There are also software interrupt requests. You have either a hardware IRQ or a soft IRQ. In the case of software a uniform mechanism across architectures, the ksoftirqd daemon (which is part of the scheduler). The hard IRQ also talks to the scheduler, but on a Jetson, many of the hardware IRQs can only go to the first cpu core (CPU0). This is because there is no I/O APIC (or equivalent) on the Jetsons. Most hardware IRQs must go to CPU0. All of those hard IRQs compete with each other.

You can set affinity for something depending on a hardware IRQ to a different core, but what you’ll actually get is the scheduler reassigning that to CPU0. Take a look at “/proc/interrupts”. This file is for hardware IRQs. A few items have access on every core, for example, timers. I don’t know if a PCIe device would have access to different cores, but you will find various PCI interrupts listed in /proc/interrupts. Is that IRQ going strictly to the core you expect? It is possible that an IRQ will be pointed at a core, and then transferred back to CPU0. You might also look near the bottom of that file and examine the line item “Rescheduling interrupts”.

The normal case for writing a Linux hardware driver gives the advice that you want to perform the very minimal activity in the hardware driver, and then if more is required, to spin what remains off as a software IRQ to a different driver. For example, a network adapter might need a checksum, and if the checksum is not performed by the network adapter, then you would want to move that checksum routine out of the hardware driver and trigger a software driver. That software interrupt can run on any core, and would shorten the time of the original hardware IRQ locking a core. This wouldn’t matter so much on a system with an I/O APIC (or similar), but it would still break up core lock to a finer degree.

I don’t know if the RPi has the ability to route hard IRQs to different cores (I don’t have an RPi, so I have never looked), but if you have some similar setup using the same NIC and/or other communications hardware, you could examine that platform’s “/proc/interrupts”. Maybe they are no different regarding hard IRQ access to cores, but if they do differ, then you won’t be able to use affinity the same way on the Jetson. If it does turn out they are the same, then there is likely some other optimization (Jedi mind trick?) you can use to improve things.

As for actual tuning, someone from NVIDIA will have to answer that. However, I suggest you examine the above first and use it to look at the issue from a different angle.

StackingOverflows · October 17, 2024, 8:08pm

@linuxdev Thank you for taking the time to reply!

How do you mean “priority 49”? If you use the “nice ” or “renice ” to change a priority, then the range is -20 to +20 , and more positive is lower priority. What is your exact method of dealing with priority?

We use sched_setscheduler(0, SCHED_FIFO, &param) with a priority of 49, which appears at value -50 in the PRI column of htop. The behavior remains the same regardless of the priority value, even with the default priority.

Also, you should probably know a bit about routing of hardware interrupts versus CPU core. What is the isolated core you speak of? Is this one of the Jetson cores using affinity? If so, then this is likely not working the way you think it is.

We isolate core 3 and use sched_setaffinity() to tie the process to the core. htop reports that only this process is running on core 3.

On a desktop PC, if you were to set affinity for a hardware device which is serviced by a hardware interrupt request, then an actual wire would be set up to associate that interrupt with the core. I will emphasize, this is a physical wire. The programming on an Intel PC for IRQ changes is via the I/O APIC (Asynchronous Programmable Interrupt Controller). AMD has its own way of doing the equivalent. The Jetson has no such mechanism.

It’s not an interrupt caused by a physical wire.

Note: /proc/interrupts is not a real file, it is a living reflection in RAM from a driver in the kernel, and it will update as activity runs. When you use “cat” or “less” on that file you freeze a snapshot, but the file is still changing.

There are also software interrupt requests. You have either a hardware IRQ or a soft IRQ. In the case of software a uniform mechanism across architectures, the ksoftirqd daemon (which is part of the scheduler). The hard IRQ also talks to the scheduler, but on a Jetson, many of the hardware IRQs can only go to the first cpu core (CPU0). This is because there is no I/O APIC (or equivalent) on the Jetsons. Most hardware IRQs must go to CPU0. All of those hard IRQs compete with each other.

You can set affinity for something depending on a hardware IRQ to a different core, but what you’ll actually get is the scheduler reassigning that to CPU0. Take a look at “/proc/interrupts”. This file is for hardware IRQs. A few items have access on every core, for example, timers. I don’t know if a PCIe device would have access to different cores, but you will find various PCI interrupts listed in /proc/interrupts. Is that IRQ going strictly to the core you expect? It is possible that an IRQ will be pointed at a core, and then transferred back to CPU0. You might also look near the bottom of that file and examine the line item “Rescheduling interrupts”.

What we did to optimize performance was to set pci=nomsi in kernel command, so that we can set the smp_affinity of tegra-pcie-intr, PCIe PME, eth0 to core 3. We also disabled any other service that would use eth0. Looking at/proc/interrupts with watch -n 0.1 "cat /proc/interrupts | grep eth0", on the CPU columns, no interrupts are happening on any other core than 3. Also, no interrupt occurs when the ethercat service is not running. The problem occurs even if we don’t change the interrupt affinity, even without the pci=nomsi kernel command. But, it does not occur on the LAN7430 that is connected to the orin.

The normal case for writing a Linux hardware driver gives the advice that you want to perform the very minimal activity in the hardware driver, and then if more is required, to spin what remains off as a software IRQ to a different driver. For example, a network adapter might need a checksum, and if the checksum is not performed by the network adapter, then you would want to move that checksum routine out of the hardware driver and trigger a software driver. That software interrupt can run on any core, and would shorten the time of the original hardware IRQ locking a core. This wouldn’t matter so much on a system with an I/O APIC (or similar), but it would still break up core lock to a finer degree.

Maybe there’s something there.

I don’t know if the RPi has the ability to route hard IRQs to different cores (I don’t have an RPi, so I have never looked), but if you have some similar setup using the same NIC and/or other communications hardware, you could examine that platform’s “/proc/interrupts ”. Maybe they are no different regarding hard IRQ access to cores, but if they do differ, then you won’t be able to use affinity the same way on the Jetson. If it does turn out they are the same, then there is likely some other optimization (Jedi mind trick?) you can use to improve things.

I use the same smp_affinity technique on the RPI. However, the RPI eth0 is tied to 2 IRQs. I set those 2 IRQs to core 3. Changing the affinity is only to help with jitter.

I’ve tried to read a lot on the subject and bang my head on trying different combination of solutions, but after trying on other hardware and making it work, I am out of ideas on what to try. Maybe there’s a kernel option that could help, but I think the ethernet port is managed by nvethernet.ko and I would like to know what Nvidia thinks of this behavior.

linuxdev · October 18, 2024, 1:44pm

I am curious if return value is being checked on sched_setscheduler() (this is probably ok since htop is reflecting the value, but if any scheduling return value fails it can probably lead to some “interesting” issues)? In htop what is the “nice” value (the NI column) of the process? Can you give an approximate CPU% for that process in htop?

I don’t know if this matters, but which core is this set to? If you then go to “/proc/interrupts” after running this for a short time, and you ignore every CPU core other than the one this has affinity set to, then at that moment how many IRQs are shown, and what is the IRQ number (the first column)? If you then go down to near the bottom of the file, then how large of a value is seen in the “Rescheduling interrupts”? This latter is a list of the number of times the process was rescheduled to a different core.

I would expect that any software IRQ never reschedules unless the scheduler sees an advantage (such as due to one core having cached what is needed). If you disable your program from running, and there is nothing altering scheduling and no special programs from your customization, then you might examine approximately what you see from the above /proc/interrupts, and then run your program (rebooting between each measurement) and get some rough idea of how your program changes the hard IRQ count and the rescheduling relative to the hard IRQ count. Maybe run some otherwise “normal” load for 10 minutes, and then try again with your program running for 10 minutes (in this latter case run the program immediately after boot if you can).

Do you still have this program working on an RPi? If so, can you attach a copy of the “/proc/interrupts” after the program has run on it for a short time (e.g., maybe a minute or two)?

What hardware does your program depend on? I’m not saying that the program itself is a hardware driver, but virtually all software eventually depends on hardware. The Ethernet itself is such, and that depends on CPU0. If your program uses the disk/eMMC, then this triggers a hardware IRQ when it needs access. In the case of network latency spikes, it is possible that some lower priority program won’t bump the Ethernet driver and will run smooth, while a higher priority process might bump the Ethernet driver and result in a priority inversion. The question is almost always about a combination of cooperating (or uncooperative) processes, so what I’m asking about is for the purpose of describing everything related to hardware load and competition for processing load.

It’d be really useful to compare that same information with the RPi, but a lot of that usefulness (not all) goes away if the RPi has each core able to route a hardware IRQ to it.

StackingOverflows · October 18, 2024, 3:43pm

@linuxdev

sched_setscheduler() returns 0.
The nice value is 0.
The process uses 1.3%, but I can see about 30% kernel usage (red bars)

linuxdev:

I don’t know if this matters, but which core is this set to? If you then go to “/proc/interrupts” after running this for a short time, and you ignore every CPU core other than the one this has affinity set to, then at that moment how many IRQs are shown, and what is the IRQ number (the first column)? If you then go down to near the bottom of the file, then how large of a value is seen in the “Rescheduling interrupts”? This latter is a list of the number of times the process was rescheduled to a different core.

I would expect that any software IRQ never reschedules unless the scheduler sees an advantage (such as due to one core having cached what is needed). If you disable your program from running, and there is nothing altering scheduling and no special programs from your customization, then you might examine approximately what you see from the above /proc/interrupts, and then run your program (rebooting between each measurement) and get some rough idea of how your program changes the hard IRQ count and the rescheduling relative to the hard IRQ count. Maybe run some otherwise “normal” load for 10 minutes, and then try again with your program running for 10 minutes (in this latter case run the program immediately after boot if you can).

The program runs on core 3.
To be clear, the IRQ affinity is set when the program starts.

Here’s what I get. I filtered lines that had 0 everywhere.
On a fresh boot, after everything has booted and before the program starts:

           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5
 13:      15547       8637       8116         94       6859       6317     GICv3  26 Level     arch_timer
114:         41          0          0          0          0          0     GICv3  57 Level     3160000.i2c
115:          2          0          0          0          0          0     GICv3  59 Level     3180000.i2c
120:        542          0          0          0          0          0     GICv3  92 Level     snd_hda_tegra
122:       3448          0          0          0          0          0     GICv3 198 Level     3550000.usb
123:        122          0          0          0          0          0     GICv3 195 Level     xhci-hcd:usb1
125:      20461          0          0          0          0          0     GICv3 208 Level     3c00000.hsp
162:       9671          0          0          0          0          0     GICv3 165 Level     c150000.hsp
167:         81          0          0          0          0          0     GICv3  58 Level     c240000.i2c
189:          1          0          0          0          0          0     GICv3 388 Level     tegra-pcie-intr, PCIe PME, eth0
191:          1          0          0          0          0          0     GICv3  77 Level     tegra-pcie-intr, PCIe PME
193:       8822          0          0          0          0          0     GICv3  83 Level     tegra-pcie-intr, PCIe PME, nvme0q0, nvme0q1
198:         25          0          0          0          0          0     GICv3 102 Level     gk20a_stall
202:          1          0          0          0          0          0     GICv3 160 Level     3d00000.hsp
210:        114          0          0          0          0          0     GICv3 214 Level     b950000.tegra-hsp
215:         22          0          0          0          0          0     GICv3 409 Level     tegra_dce_isr
227:          2          0          0          0          0          0  2200000.gpio 131 Level     1-0025
IPI0:      8232      12236      11914         35      10873       6495       Rescheduling interrupts
IPI1:      1890       1663        997        457        841        776       Function call interrupts
IPI2:         0          0          0          0          0          0       CPU stop interrupts
IPI3:         0          0          0          0          0          0       CPU stop (for crash dump) interrupts
IPI4:         0          0          0          0          0          0       Timer broadcast interrupts
IPI5:     13706       1596       1771          3       1194       1419       IRQ work interrupts
IPI6:         0          0          0          0          0          0       CPU wake-up interrupts
Err:          0

After setup phase in the program :

           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5
 13:      17538      11177      10207        219       7131       6662     GICv3  26 Level     arch_timer
114:         41          0          0          0          0          0     GICv3  57 Level     3160000.i2c
115:          2          0          0          0          0          0     GICv3  59 Level     3180000.i2c
120:        590          0          0          0          0          0     GICv3  92 Level     snd_hda_tegra
122:       4063          0          0          0          0          0     GICv3 198 Level     3550000.usb
123:        122          0          0          0          0          0     GICv3 195 Level     xhci-hcd:usb1
125:      20851          0          0          0          0          0     GICv3 208 Level     3c00000.hsp
162:       9671          0          0          0          0          0     GICv3 165 Level     c150000.hsp
167:         81          0          0          0          0          0     GICv3  58 Level     c240000.i2c
189:          1          0          0        432          0          0     GICv3 388 Level     tegra-pcie-intr, PCIe PME, eth0
191:          1          0          0          0          0          0     GICv3  77 Level     tegra-pcie-intr, PCIe PME
193:       8838          0          0          0          0          0     GICv3  83 Level     tegra-pcie-intr, PCIe PME, nvme0q0, nvme0q1
198:         25          0          0          0          0          0     GICv3 102 Level     gk20a_stall
210:        114          0          0          0          0          0     GICv3 214 Level     b950000.tegra-hsp
215:         22          0          0          0          0          0     GICv3 409 Level     tegra_dce_isr
227:          2          0          0          0          0          0  2200000.gpio 131 Level     1-0025
IPI0:      9262      14135      13447         40      11299       6649       Rescheduling interrupts
IPI1:      1998       1758       1047        457        867        781       Function call interrupts
IPI2:         0          0          0          0          0          0       CPU stop interrupts
IPI3:         0          0          0          0          0          0       CPU stop (for crash dump) interrupts
IPI4:         0          0          0          0          0          0       Timer broadcast interrupts
IPI5:     13744       1610       1781          4       1269       1560       IRQ work interrupts
IPI6:         0          0          0          0          0          0       CPU wake-up interrupts
Err:          0

After the program runs for 1.6min :

           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5
 13:      37298      28116      30897     137282      10487      10509     GICv3  26 Level     arch_timer
114:         41          0          0          0          0          0     GICv3  57 Level     3160000.i2c
115:          2          0          0          0          0          0     GICv3  59 Level     3180000.i2c
120:       1086          0          0          0          0          0     GICv3  92 Level     snd_hda_tegra
122:      14048          0          0          0          0          0     GICv3 198 Level     3550000.usb
123:        122          0          0          0          0          0     GICv3 195 Level     xhci-hcd:usb1
125:      24339          0          0          0          0          0     GICv3 208 Level     3c00000.hsp
162:       9671          0          0          0          0          0     GICv3 165 Level     c150000.hsp
167:         81          0          0          0          0          0     GICv3  58 Level     c240000.i2c
189:          1          0          0     974983          0          0     GICv3 388 Level     tegra-pcie-intr, PCIe PME, eth0
191:          1          0          0          0          0          0     GICv3  77 Level     tegra-pcie-intr, PCIe PME
193:       8986          0          0          0          0          0     GICv3  83 Level     tegra-pcie-intr, PCIe PME, nvme0q0, nvme0q1
198:         25          0          0          0          0          0     GICv3 102 Level     gk20a_stall
210:        114          0          0          0          0          0     GICv3 214 Level     b950000.tegra-hsp
215:         22          0          0          0          0          0     GICv3 409 Level     tegra_dce_isr
227:          2          0          0          0          0          0  2200000.gpio 131 Level     1-0025
IPI0:     19892      26064      35012         87      21453      16025       Rescheduling interrupts
IPI1:      3185       2699       1513        457       1074        810       Function call interrupts
IPI2:         0          0          0          0          0          0       CPU stop interrupts
IPI3:         0          0          0          0          0          0       CPU stop (for crash dump) interrupts
IPI4:         0          0          0          0          0          0       Timer broadcast interrupts
IPI5:     14047       1854       1903       4832       1963       2681       IRQ work interrupts
IPI6:         0          0          0          0          0          0       CPU wake-up interrupts
Err:          0

Still on CPU3
Before :

           CPU0       CPU1       CPU2       CPU3
 11:      30930      32590      34217       3668     GICv2  30 Level     arch_timer
 14:        898          0          0          0     GICv2  65 Level     fe00b880.mailbox
 15:        693          0          0          0     GICv2 114 Level     DMA IRQ
 28:       2802          0          0        151     GICv2 189 Level     eth0
 29:          0          0          0          0     GICv2 190 Level     eth0
 30:        165          0          0          0  BRCM STB PCIe MSI 524288 Edge      xhci_hcd
 31:         72          0          0          0     GICv2  66 Level     VCHIQ doorbell
 36:       7383          0          0          0     GICv2 153 Level     uart-pl011
 37:      45704          0          0          0     GICv2 158 Level     mmc1, mmc0
 40:          3          0          0          0     GICv2 129 Level     vc4 hvs
 41:          1          0          0          0  interrupt-controller@7ef00100   4 Level     vc4 hdmi hpd connected
 52:          3          0          0          0     GICv2 133 Level     vc4 crtc
IPI0:      7368      16910      21042         18       Rescheduling interrupts
IPI1:     10324       5734       1727        537       Function call interrupts
IPI2:         0          0          0          0       CPU stop interrupts
IPI3:         0          0          0          0       CPU stop (for crash dump) interrupts
IPI4:         0          0          0          0       Timer broadcast interrupts
IPI5:      1470       2182       1980          3       IRQ work interrupts
IPI6:         0          0          0          0       CPU wake-up interrupts

After :

           CPU0       CPU1       CPU2       CPU3
 11:      54872      56490      49658     169746     GICv2  30 Level     arch_timer
 14:        953          0          0          0     GICv2  65 Level     fe00b880.mailbox
 15:        781          0          0          0     GICv2 114 Level     DMA IRQ
 28:       2802          0          0     320924     GICv2 189 Level     eth0
 29:          0          0          0          0     GICv2 190 Level     eth0
 30:        165          0          0          0  BRCM STB PCIe MSI 524288 Edge      xhci_hcd
 31:         72          0          0          0     GICv2  66 Level     VCHIQ doorbell
 36:       7383          0          0          0     GICv2 153 Level     uart-pl011
 37:      59409          0          0          0     GICv2 158 Level     mmc1, mmc0
 40:          3          0          0          0     GICv2 129 Level     vc4 hvs
 41:          1          0          0          0  interrupt-controller@7ef00100   4 Level     vc4 hdmi hpd conne
 52:          3          0          0          0     GICv2 133 Level     vc4 crtc
IPI0:     19796      35284      43493         26       Rescheduling interrupts
IPI1:     13179       7432       2472        537       Function call interrupts
IPI2:         0          0          0          0       CPU stop interrupts
IPI3:         0          0          0          0       CPU stop (for crash dump) interrupts
IPI4:         0          0          0          0       Timer broadcast interrupts
IPI5:      1559       2233       2015        217       IRQ work interrupts
IPI6:         0          0          0          0       CPU wake-up interrupts
Err:          0

The problem did not occur on the RPI4

In this test program, The only peripheral used is the Ethernet port. The disk is not explicitely accessed during the program (no logging of any sort besides the point samples after the program as ran).

StackingOverflows · October 18, 2024, 4:22pm

@linuxdev

So, I monitored what IRQs were being triggered and saw that tegra-pcie-intr seems to be all bound together in the kernel.

since nvme0q0 and eth0 are tied to tegra-pcie-intr, I tried to stress the nvme ssd with stress --hdd 1 --timeout 1m -v at the same time the program was running.

Here is the graph :

Your intuition on Rescheduling interrupts seems to be right. Can Nvidia confirm this?

The cone pattern might be unrelated to this but there’s something there.

linuxdev · October 19, 2024, 6:06pm

Just adding notes as I read…

On the Jetson it appears the IRQ for your program is going through PCIe. Is the Ethernet you are using from an add-on NIC? On the RPi, the IRQ is shown as the Ethernet running directly from CPU3 without PCI. I don’t know if the Ethernet of your Jetson is going through PCI or if it is directly wired to the memory controller; evidence is the former, that your Orin has the Ethernet on PCI. This does mean that both PCI and Ethernet are part of the timing dependencies on the Jetson, but the RPi does not depend on PCI. I do not know the significance. What this does confirm is that it is possible for a separate core to handle your hardware IRQs and to distribute that load. So there is a significant difference between the RPi structure and the Jetson structure.

Your observation of the competing NVMe does really stick out. Now NVMe and Ethernet are competing for driver time. It is my assumption (which might be incorrect) that this PCIe bus handling the Ethernet and the NVMe are sharing in some way and that the two PCIe controllers are not independent, that it is in fact a single controller so both PCIe devices cannot run at the same time.

Just out of curiosity, do you have any swap running to the NVMe? What do you see from “free -m”? If swap is used, then NVMe activity will go up (any ramdisk is unrelated) in bursts (it is a block device). Check free -m early on, and check it again under heavy load.

This kind of confirms (“soft” confirm…it isn’t a smoking gun evidence) that the NVMe’s use of PCIe is momentarily slowing down Ethernet due to the shared PCI controller. I couldn’t tell you how to improve on that, but NVIDIA does likely have someone who can work with this and find out if there is some adjustment that can be made for better performance. Might be a question for @WayneWWW or @KevinFFF to forward to the driver people.

WayneWWW · October 21, 2024, 3:25am

Sorry that I just joined. Could someone make a summary for what do you need us to check?

StackingOverflows · October 21, 2024, 4:10pm

Hi, I’m having trouble with latency spikes on eth0 on the devkit and on our custom board. After investigation, it seems it’s somewhat related to eth0 (MDI pins) sharing the same IRQ as other PCI devices (including the nvme SSD). This behavior happens on the RT kernel and the NON-RT kernel.

Searching through the forum, I’ve found out I’m not the only one with PCI performance issue and it looks like it is particular to jetson linux 36.X. I’ve installed back jetson linux 35.6, and the issue does not occurs. It seems to be because of PCI-MSI and I’ve looked at some posts and a guy made a patch to fix the issue, did not try yet.

My question right now is :
Is this patch safe to apply in production? Will it break something? Is my problem related to something else?

The patch :

The difference between 35.6 and 36.3 :

WayneWWW · October 21, 2024, 4:12pm

Could you try whether that patch would fix this issue yet? We can discuss about that safety thing later.

StackingOverflows · October 21, 2024, 4:12pm

Will do. I have to rebuild the kernel so I may have an answer in few hours.

linuxdev · October 21, 2024, 6:40pm

Incidentally, his RPi does not use PCIe for the Ethernet. The Orin does seem to use PCIe for Ethernet. Hardware IRQs in “/proc/interrupts” show the reason for the lower performance is due to the time spent servicing his process (he is using SMP affinity) and Ethernet when on the Jetson, but it is a non-problem on the RPi simply because the drivers of the RPi are using different cores and do not compete with each other.

The driver might be tuned, but at the moment Ethernet and NVMe are competing for time on the same core. I believe the NVMe and Ethernet IRQs force use of a single core for both of their functions. If the IRQ for one PCIe device cannot be assigned a different core versus the second PCIe device, then tuning is the only option.

StackingOverflows · October 22, 2024, 4:06am

I have tried the patch on a basic sample rootfs and the RT kernel. All building steps were done from scratch.

While it has greatly increased performance, the cone shape pattern is still there, unlike in 35.6.

WayneWWW · October 22, 2024, 4:10am

Sorry to ask this if this is mentioned. What is the meaning of the axis in this picture?

And is this result from NV devkit?

StackingOverflows · October 22, 2024, 10:23am

Yes, results are from a devkit.

I send Ethercat raw packets to a slave at each 0.8ms and It returns its clock time.
The Y axis represent the deviation of a message in nanoseconds.
The X axis is the sample number.
A point represent : TimeNow-TimePrevious-0.8ms
a value of 0 is a perfect timing.
With jetson 35.6 non-RT, I have a deviation of only 16000 nanoseconds under nvme load.

With jetson 36.3 RT and the IRQ patch, I get a deviation of 2000 nanoseconds but with cone shaped spikes.

With jetson 36.3 RT, a PCI LAN7430 controller does not have spikes and have good deviation.

A spike first have a high value, then a value of almost its exact negative. Seems to me that the NIC driver sends a packet late, and tries to catchup after.

WayneWWW · October 22, 2024, 11:11am

Hi,

What is the result of non-RT kernel? Also, as your result, it seems indicate it is also related to the PHY in use?
LAN7430 and on-board etherent are both on PCIe but seems only the on board ethernet (r8168) has the issue?

StackingOverflows · October 22, 2024, 11:20am

The results of non-RT kernel on 36.3 is higher deviation and still cone shaped spikes.

On non-RT kernel for 35.6, low deviation, no spikes.

Yes, the problem seems to occur only with the onboard ethernet (Orin’s MDI pins). The problem is present in all our devkits, and on our custom motherboard.

WayneWWW · October 22, 2024, 11:29am

Hi,

Notice you have a typo there. I guess you are talking about 36.3?

The results of non-RT kernel on 36.5 is higher deviation and still cone shaped spikes.

If you are using 36.3, I would like to know if you could also test the result of 36.4.
We have a significant change for the on-board ethernet driver between 36.4 and 36.3. I don’t guarantee it would definitely work, but as this is large change, I think it should be tested first.

StackingOverflows · October 22, 2024, 11:31am

Will do.
And yes, it was a typo.

StackingOverflows · October 22, 2024, 6:40pm

The problem seems to not happend on Jetson Linux 36.4.
I get an average deviation of 500ns and a max deviation of 8200ns. That’s with the non-RT kernel

I couldn’t make the devkit boot by building a RT version, so I had to use sdkmanager to install 36.4. I then tried to install the RT packages with
sudo apt install nvidia-l4t-rt-kernel nvidia-l4t-rt-kernel-headers nvidia-l4t-rt-kernel-oot-modules nvidia-l4t-display-rt-kernel

But I get the same error as with my built image which is this :

Topic		Replies	Views
Localhost/TCP Ping Latency Jetson TK1 Jetson TK1	12	3754	January 21, 2015
IRQ Balancing Jetson AGX Xavier ethernet	17	5286	October 18, 2021
C++ code execution speeds up after interrupt is triggered Jetson TX2	12	909	May 16, 2019
Network connection loss when TX ring full Jetson AGX Xavier ethernet	18	3314	October 18, 2021
Port ethernet performance Jetson TX2	19	2207	October 18, 2021
TK1 kernel oops : Terga4linux 21.5 : Freeze under high IRQ load / IRQ affinity Jetson TK1	15	2115	May 2, 2017
Jetson Xavier NX - issue with the PCIe communication Jetson Xavier NX pcie	8	1763	August 24, 2022
Ethernet speed increases when micro USB 2.0 connector is connected Jetson TX1	23	4273	October 18, 2021
Kernel panic on Jetpack 4.6.1 Xavier NX Jetson Xavier NX kernel	17	1654	October 27, 2022
cpu0 locked out for 3ms every 100ms Jetson TX1	16	1106	October 18, 2021

Ethernet Port - Strange/Unusual latency spikes pattern

Related topics