Heavy network traffic causes high CPU usage in ksoftirq

During some testing, I’ve noticed something with my TX2 under heavy Ethernet load. The ksoftirq process goes nuts and uses 100% CPU time. Seems an easy way to trigger this is starting an iperf server on the TX2, and connecting with a Gig-E laptop.

I’ve seen this behavior on L4T r28.1 and r32.1. Enabling irqbalance seems to help a little bit on r32.1, but still seeing network stalls, and lots of interrupts reported in /proc/softirq. Irqbalance and r28.1 seems a bad combination, still reporting higher CPU load, and also spamming “Ethernet interrupt while in poll!” both to the console and to the system log at around 110 Hz.

While in talks with my contact at ConnectTech, they suggested something I’d like to try, using ethtool to increase network card TX/RX buffers. Apparently this has helped in other applications. However, the current eqos driver for the onboard Ethernet device does not appear to support this? They suggested a different driver was used in past versions of L4T and did work?

Any thoughts, or things I can try to get network performance a little more reliable?

Hi,

Could you help share the data you have observed and the method to reproduce this issue?

Though I initially noticed it while having the TX2 act as a network server, it seems easiest to stress this with iperf. On the TX2, run:

$ sudo apt-get install iperf
$ iperf -s

Then from the client, run:

$ iperf -c [tx2 IP address]

Then run “top” in a second console on the TX2. The ksoftirq/0 process should spike to 100%.

Another interesting datapoint. I have my TX2 mounted on a ConnnectTech Elroy board, and needed to install a second network adapter for increased network bandwidth. This adapter is a Syba half size mPCIE Ethernet adapter using a Realtek R8168 chipset. Not only does the Realtek adapter show a higher transfer speed in iperf, but there’s no apparent CPU spike in ksoftirq.

Fair warning that I am using a kernel compiled from the ConnectTech BSP sources. However, the ConnectTech guys felt this might be an upstream issue, and that I should ask here.

Some kernel design information might be useful…

The kernel scheduler determines what process gets what time and which CPU core to use. Basically running drivers. The drivers themselves probably have certain pieces of code which talk directly to the hardware, and there is no avoiding locking a core to that task during that time. For example, i/o to a physical address of the hardware might only be applicable at certain times on a shared bus. The kernel must use atomic/non-divisible operations during that time and preemption will not be possible.

However, there are times when things are done in the kernel which do not require locking the CPU core to a given function. When that code is run scheduling becomes much more like user space and the ksoftirq management software can migrate that code and run it based on fair sharing times. Atomic operations are not required for this. Ksoftirq times indicates purely software operations or operations which can be preempted are sharing cores (this is a “good thing”).

A bad kernel driver will do both mandatory atomic operations and any related operations without ever unlocking the CPU core. A good kernel driver design will put the atomic code in one place, and the code which can be preempted somewhere else. Then mandatory locking will occur only for the part of the code which needs this (someone must know this and program accordingly), and code which can share time and not corrupt or fail with multitasking will be handled separately with ksoftirq. During the time of mandatory locking nobody else can use that core and the code will not share (the system will become less responsive…it’s good to have more CPU cores for that case, but cache misses mean extra cores are still not as good as sharing of a core).

Seeing a large amount of traffic resulting in ksoftirq running says the authors had a good driver design for that hardware (someone separated atomic from sharable operations). However, differences in ethernet hardware do exist. Not all ethernet hardware has the same features. In some cases some hardware actually offloads work to the ethernet chipset and the kernel does not need to do that work. Other hardware may depend on the CPU for parts of the work. An example would be hardware compression versus software compression, and some (but not all) chipsets will support hardware compression…hardware compression would be lower latency and lower CPU load, but the work would still be done…only that work would be done in the ethernet hardware instead of CPU. This latter case would not show CPU load and would avoid ksoftirq, but the reason would not be due to bad design: The reason would be due to non-CPU hardware performing the same task, and this is a “good thing” if the compression hardware exists (I don’t know what the existing hardware supports, this is just a contrived example).

If an ethernet driver were to do all of the work in atomic/non-preemptable fashion, but not need to do so, then you would also see a lack of load on ksoftirq. However, this would be bad design. If code can be offloaded to ksoftirq, but is not offloaded, then the CPU core is locked longer than needed and unable to share or multitask for excessive lengths of time. The work would still be done, but the work would be done for that driver and would not show up under ksoftirq.

In your case there are these possibilities:

  • The ethernet hardware is offloading work so ksoftirq is not needed for the system not loading ksoftirq.
  • The driver without ksoftirq loading is failing good design and the work is being performed atomically within the driver: Side effect being a less responsive system and seeing the load somewhere other than ksoftirq.
  • Something in the design or operation of the low softirq case does not require the work which ksoftirq was doing for high ksoftirq load. This is possible and not unreasonable in cases where perhaps different network settings are being used (the hardware and drivers could be the same, but the data and parameters being processed differ). A contrived example might be one system is using jumbo frames (low load) and the other is not (high load)...it would appear to be the driver's or hardware's fault, but in reality it would be a case of different loads from different data handling and the observer not knowing about the jumbo frame differences.

Hi linuxdev,

Thanks for the response, though it may have be less helpful than you intended.

What I’m after is a way for my TX2 to receive simultaneous high throughput streams via NFS, and my TX2’s onboard Ethernet sustains measurably less throughput in my application. Understanding apparent driver quality doesn’t get me any closer to determining whether or not I can do anything about the asymmetric performance. I do understand some boards do have internal bus limitations (some ODroids for example), and if that’s the case than so be it.

On the chance that the ksoftirq performance is an unnecessary spinlock somewhere, it’s worth asking.

What has helped in other applications has been to bump the TX/RX buffers to decrease the interrupt rate. Understood that depending on the Ethernet hardware, this may not be possible with that controller’s feature set. It may not even help much, if the long pole in the tent is something like a long-running PIO transfer.