IRQ Balancing

Hi everyone,

Im using a jetson Xavier AGX dev kit.
I plugged a PCI card to have 4 more ethernet ports.
When using all 4ports, i got lots of Interrupts.
All interrupts seems to be handled by core#1 of CPU
So i tried irqblaance to try to balance my 4 eth on the 8 cores.
but it doesnt do nothing do you have any idea on how i can put one Eth interrupts on core 2, the other on core 3 … ?

Thank you !

On an Intel desktop PC there is an “IO-APIC” to distribute hardware interrupts. AMD CPUs differ in architecture from Intel PC CPUs, and can handle hardware interrupts with different methods. On the Jetson I think you will find some hardware is available on all cores (such as the memory controller), but many hardware interrupts cannot be offloaded to other cores. The best you can do is to use CPU affinity to force other processes away from CPU0 (first core), or perhaps to add priority to some code or reduce priority on other cores.

Someone else may know more about this and have other means of tweaking performance. There may have been improvements since I last looked at this.

Hi,

Thanks for your reply !

Do you know where i can have documentation about hardware interrupts on Jetson Xavier ?

Concerning your suggestion about CPU affinity, i already tried but i ll do more investigations.
What i saw from HTOP is that CPU#0 is 100% used by kernel processes and more than 50% used by ksoftirqd
Using top, it confirmed that HI & SI use 100% of CPU#0
So even by distributing other processes it didnt change anything.
There must be a solution, CPU#0 is 100% and other 7 cores are less than 40%
Im pretty sure finding a solution for that, would be of great value & increase a lot Xavier’s performances !

Thank again

Please try changing smp affinity using ‘/proc/irq/$IRQ/smp_affinity_list’ and share if interrupts are still not routed as set.
https://www.kernel.org/doc/Documentation/IRQ-affinity.txt
You can get IRQ number($IRQ) from ‘/proc/interrupts’.

Thanks !

Got this :

root@nvidia-desktop:/home/nvidia# echo 0f > /proc/irq/820/smp_affinity_list
bash: echo: write error: Input/output error

Seems we cannot do that.
Cause im logged as root so it should be ok.

Francois

Might be something specific to that IRQ. Could you try another IRQ.
I am able to change affinity for below.
root@tegra-ubuntu:/home/ubuntu# echo 80 > /proc/irq/478/smp_affinity
root@tegra-ubuntu:/home/ubuntu# cat /proc/irq/478/smp_affinity
80

It works for me on irq 80 also.

I got some answers.

  40:     564819      11302          0          0          0          0          0       5572     GICv2  226 Level     ether_qos.common_irq
  42:     280932       5523          0          0          0          0          0        850     GICv2  222 Level     2490000.ether_qos.rx0
  43:    1004400       9987          0          0          0          0          0        443     GICv2  218 Level     2490000.ether_qos.tx0

it works on the IRQ above. It seems to be because it GICv2. so not virtual IRQ.

 819:          1          0          0          0          0          0          0          0   PCI-MSI    0 Edge      eth1
 820:       9009         0          0          0          0          0          0          0   PCI-MSI    1 Edge      eth1-TxRx-0
 821:       9225         0          0          0          0          0          0          0   PCI-MSI    2 Edge      eth1-TxRx-1
 822:       8535         0          0          0          0          0          0          0   PCI-MSI    3 Edge      eth1-TxRx-2
 823:      26194         0          0          0          0          0          0          0   PCI-MSI    4 Edge      eth1-TxRx-3
 824:     763077       0          0          0          0          0          0          0   PCI-MSI    5 Edge      eth1-TxRx-4
 825:       8523         0          0          0          0          0          0          0   PCI-MSI    6 Edge      eth1-TxRx-5
 826:       8520         0          0          0          0          0          0          0   PCI-MSI    7 Edge      eth1-TxRx-6
 827:   83353210     0          0          0          0          0          0          0   PCI-MSI    8 Edge      eth1-TxRx-7

it doesnt work on this ones because its PCI-MSI (so if i understand well, virtual IRQ inherited from mother device/IRQ).

So i tried to change the PCIe irq and then i got this :

 819:          1          0          0          0          0          0          0          0   PCI-MSI    0 Edge      eth1
 820:       9009         33          0          0          0          0          0          0   PCI-MSI    1 Edge      eth1-TxRx-0
 821:       9225         30          0          0          0          0          0          0   PCI-MSI    2 Edge      eth1-TxRx-1
 822:       8535         30          0          0          0          0          0          0   PCI-MSI    3 Edge      eth1-TxRx-2
 823:      26194         92          0          0          0          0          0          0   PCI-MSI    4 Edge      eth1-TxRx-3
 824:     763077       2845          0          0          0          0          0          0   PCI-MSI    5 Edge      eth1-TxRx-4
 825:       8523         30          0          0          0          0          0          0   PCI-MSI    6 Edge      eth1-TxRx-5
 826:       8520         30          0          0          0          0          0          0   PCI-MSI    7 Edge      eth1-TxRx-6
 827:   83353210     307437          0          0          0          0          0          0   PCI-MSI    8 Edge      eth1-TxRx-7

The thing/issue with that is that, i cant separate the IRQ of the 4 PCIe cards plugged (through a splitter) into the PCIe of the devkit.

Thanks for your help.

It depends what is running on CPU0. Anything requiring hardware access will have to remain there, but often the CPU0 driver will spawn side work to ksoftirqd. This can then migrate to other cores, while the hardware dependent code cannot. Having 100% on CPU0 is not wrong unless it is from processes which could be elsewhere.

@sumitg mentioned “/proc/interrupts”. You’ll notice that these are hardware. If you look at the name in the right most column these are actual hardware devices, often named with a physical address if there are multiple copies of a controller. For example, each i2c controller is listed, and the address is part of the prefix. Every core has timers directly wired, and so interrupts will be seen on each core for timers. Other than this pretty much every IRQ count given is under CPU0. If this were an Intel CPU on a PC, then you’d see this distributed more because of the programmable IO-APIC.

What you really need, but what I do not have enough knowledge for, is to profile time spent under CPU0 hardware IRQs to know what is using the most time. Then to look at whether the time consuming IRQs really need to do all of their work under CPU0. My guess is that drivers for most of the hardware is already very highly tuned, and that only custom drivers might need to be optimized to offload some of the work to ksoftirqd instead of doing it all on CPU0.

The work which is on CPU0 which is not from a hardware IRQ will not be listed in /proc/interrupts. All of this work could be moved to other cores to reduce what hardware i/o must do, but this is not something you can easily just do and be done with it…doing this well would take a lot of time and experimentation.

If you look towards the bottom of /proc/interrupts you will also see interrupt rescheduling, which might be a case of higher priority interrupts preempting lower priority interrupts. You will normally see a lot of these, and it isn’t something you can directly determine is “too many”, but if you have a fully loaded system which is not running badly, e.g., not running your tasks, then you can sort of watch this and get an idea of how fast it is going up. A command like this:
watch -n 1 -x egrep '(CPU0|IPI[0-9][:])' /proc/interrupts

Now if you can look at this and have a feel for how fast things “normally” change (a bit like the guy in the Matrix movie reading the data directly from the glyphs on the screen), you could run your code and see if rescheduling goes up a lot. Not a very technical way of doing it, but if there is too much rescheduling this might be a case of IRQ starvation.

If there is starvation, perhaps there is a way to give your driver a higher priority, but I don’t think I can help with that.

If you want to add the IRQ number for a set of IRQs to the watch command, it would go something like this (I’m randomly picking IRQs, one is for ethernet and another is for a USB controller):

watch -n 1 -x egrep '(CPU0|IPI[0-9][:]| 41[:]| 21[:])' /proc/interrupts

(the change is that I added " 41[:]| 22[:]" to see IRQs 21 and 41).

That particular sample is interesting because it is IRQ traffic from mmc0 and ethernet…typically these can require a lot if traffic is heavy.

Non-hardware-IRQ traffic will be unrelated to “/proc/interrupts”. Intentional migration of affinity is almost always from purely software processes which do not require direct hardware access. These are the ones you have a lot of control over, and if these are on CPU0, then it might be a good idea to force these software processes somewhere other than CPU0 (this still wouldn’t matter if you are not approaching IRQ starvation, although any movement would probably reduce hardware driver latencies).

Note that ksoftirqd is relatively smart and tends to use those other cores fairly well. You probably don’t need to interfere with those processes unless you run into some specific odd condition. If you see the process in “htop” or “top” or “ps”, then you will be interested in looking at setting affinity to a non-CPU0 core. If you have a single critical process, then perhaps you might assign that and only that to a specific core. It’s a lot of art and experimentation.

Thanks for your very interesting answer.
I ll have a try on your ideas and come back on the post.

Francois

Here is an update on what i did.
I manage to have a functionnal setup.

I just set the affinity for PCIe IRQs on CPU#1.
Then CPU#0 is a lot offloaded because PCIe handles 4 Gigabit Ethernet.
Now, i can even add a 5th Ethernet (with usb adaptor).

What i still dont understand, is why irqbalance doesnt do that by itself !

Next step, but i will try later, i would like to see if there is a way to handle each PCIe end point on different CPUs.
But according to what you explained and what i read from internet, it seems to be tricky.

Thanks for your help !

Hum last point i dont get : why softirqd is not using other CPUs ?
cause when the CPU#0 is 100% it should go to other CPUs no ?
This is not the case ; CPU#0 is full loaded but softirqd still goes on CPU#0 using more than 50% as written in htop.

So i think its because CPU#0 is full and cannot handle more interruptions, it delegates management of all the remaining interruptions to softirqd no ?
Then why as CPU#0 is already 100%, softirqd is not launched on cPU#1 ? or 2 ? or …
In htop i see process of softirqd on all CPUs but only cpu#0 is used.

The scheduler may not know enough about the hardware, and especially may not know what your priorities are.

The scheduler determines where soft interrupts go, but the scheduler may purposely try to keep interrupts on the original core in order to take advantage of cache hits. Know that “/proc/interrupts” does not show soft interrupt activity.

Also know that if a driver runs on CPU0 in order to service a hardware interrupt, then the ksoftirq which might result from a driver knowing it can split off the work load originated on CPU0 (see below about the scheduler trying to keep cache hits and avoiding moving to a new core if cache could miss). One driver can spawn work for another. In your case you are concerned with hardware IRQ drivers splitting work into both hardware IRQ and software IRQ workloads.

CPU0 can only offload software IRQs to other cores for content not requiring direct hardware access to a physical address. The other cores on an Intel based machine with an IO-APIC are able to reach physical addresses of hardware, but without this, the ARM cores which are not specifically wired to be able to reach those devices will send the work back to CPU0 anyway if they need an unreachable address…or there will be an error. This is one reason why CPU0 is used for all initial boot content.

CPU0 being overloaded is IRQ starvation. This implies the need to offload to other cores. This cannot happen if CPU0 is the only core wired to reach certain physical addresses. Software offload is often a case of someone tuning for this, otherwise it will just use the scheduling of ksoftirqd, and this may not be smart enough for your situation.

I do not know if each PCIe endpoint can be offloaded or not to new cores. Each core does have access to the memory controller (virtual memory is implied), and so I suspect this might be possible for operations going through the memory controller…but I do not know for certain.

One thing the scheduler is fairly smart about is knowing to not randomly switch cores since this would cause cache misses.

Hum last point i dont get : why softirqd is not using other CPUs ?
cause when the CPU#0 is 100% it should go to other CPUs no ?
This is not the case ; CPU#0 is full loaded but softirqd still goes on CPU#0 using more than >> 50% as written in htop.

So i think its because CPU#0 is full and cannot handle more interruptions, it delegates management of all the remaining interruptions to softirqd no ?
Then why as CPU#0 is already 100%, softirqd is not launched on cPU#1 ? or 2 ? or …
In htop i see process of softirqd on all CPUs but only cpu#0 is used.

Per-cpu ‘ksoftirqd/n’ thread is woken for the same cpu on which hard irq is received.
So, setting SMP affinity for an IRQ to a non-boot core should offload both the hard irq and softirq processing to that core.
This is agnostic to scheduler.

static void wakeup_softirqd(void)
{
		/* Interrupts are disabled: no need to stop preemption */
		struct task_struct *tsk = __this_cpu_read(ksoftirqd);

		if (tsk && tsk->state != TASK_RUNNING)
				wake_up_process(tsk);
}

Ok thx.
I manage to offload enough for my purpose.
But i didnt reach what i wanted to do.

my 4 eth ports PCIe card is still handled by IRQ 39 physically.
then i see my 4 eth IRQs like this :

 819:          1          4          0          0          0          0          0          0   PCI-MSI    0 Edge      eth1
 820:       9035     179129       2828          8          0          0          0       3056   PCI-MSI    1 Edge      eth1-TxRx-0
 821:       9301     278157       8570         24          0          0          0       9326   PCI-MSI    2 Edge      eth1-TxRx-1
 822:       8560     172094       2670          8          0          0          0       2897   PCI-MSI    3 Edge      eth1-TxRx-2
 823:      26269    1254696       8094         23          0          0          0       8781   PCI-MSI    4 Edge      eth1-TxRx-3
 824:     763158     906282       8986         25          0          0          0       9690   PCI-MSI    5 Edge      eth1-TxRx-4
 825:       8548     534360       2669          8          0          0          0       2895   PCI-MSI    6 Edge      eth1-TxRx-5
 826:     187001  423297513   27996295      78773          0          0          0   30383082   PCI-MSI    7 Edge      eth1-TxRx-6
 827:   83353235  202324415       2669          8          0          0          0       2895   PCI-MSI    8 Edge      eth1-TxRx-7
 828:          1          4          0          0          0          0          0          0   PCI-MSI    9 Edge      eth2
 829:     687796    4106510     225938        636          0          0          0     244956   PCI-MSI   10 Edge      eth2-TxRx-0
 830:      26523     669727       2669          8          0          0          0       2895   PCI-MSI   11 Edge      eth2-TxRx-1
 831:       8543     172227       2669          8          0          0          0       2895   PCI-MSI   12 Edge      eth2-TxRx-2
 832:   83318684 1369288545       2669          8          0          0          0       2895   PCI-MSI   13 Edge      eth2-TxRx-3
 833:       9076     186425       2823          8          0          0          0       3055   PCI-MSI   14 Edge      eth2-TxRx-4
 834:     204664  423313353   27995026      78752          0          0          0   30380740   PCI-MSI   15 Edge      eth2-TxRx-5
 835:       8541    9997382       2669          8          0          0          0       2895   PCI-MSI   16 Edge      eth2-TxRx-6
 836:       9325     194832       2934          8          0          0          0       3239   PCI-MSI   17 Edge      eth2-TxRx-7
 837:          1          4          0          0          0          0          0          0   PCI-MSI   18 Edge      eth3
 838:     223210  423418550   27992732      78761          0          0          0   30379702   PCI-MSI   19 Edge      eth3-TxRx-0
 839:     685392     923060      13955         39          0          0          0      15147   PCI-MSI   20 Edge      eth3-TxRx-1
 840:       9250     192793       2923          7          0          0          0       3206   PCI-MSI   21 Edge      eth3-TxRx-2
 841:   83340910 1369382796       2669          7          0          0          0       2895   PCI-MSI   22 Edge      eth3-TxRx-3
 842:      11232    4039953     233355        688          0          0          0     253257   PCI-MSI   23 Edge      eth3-TxRx-4
 843:       8545     171953       2669          7          0          0          0       2895   PCI-MSI   24 Edge      eth3-TxRx-5
 844:       8559   11912438       2669          7          0          0          0       2895   PCI-MSI   25 Edge      eth3-TxRx-6
 845:       8541     175656       2669          7          0          0          0       2895   PCI-MSI   26 Edge      eth3-TxRx-7
 846:          1          4          0          0          0          0          0          0   PCI-MSI   27 Edge      eth4
 847:     187469  422212460   27990913      78730          0          0          0   30376906   PCI-MSI   28 Edge      eth4-TxRx-0
 848:     667690    1611434       2669          7          0          0          0       2895   PCI-MSI   29 Edge      eth4-TxRx-1
 849:      26258     524508       8096         22          0          0          0       8780   PCI-MSI   30 Edge      eth4-TxRx-2
 850:       8547     171959       2669          7          0          0          0       2895   PCI-MSI   31 Edge      eth4-TxRx-3
 851:      30210    4131348     245018        721          0          0          0     266009   PCI-MSI   32 Edge      eth4-TxRx-4
 852:   83328386 1369247591       2669          7          0          0          0       2895   PCI-MSI   33 Edge      eth4-TxRx-5
 853:       8561     171960       2669          7          0          0          0       2895   PCI-MSI   34 Edge      eth4-TxRx-6
 854:      27315     473088       2935          7          0          0          0       3240   PCI-MSI   35 Edge      eth4-TxRx-7

So changing affinity on IRQ39 change all eth affinity (as i guess its the IRQ of the PCIe)
But i cant change individually (eth1 on CPU1, eth2 on CPU2…)

Does the Jetson SoC have an IRQ aggregator capable of offloading all hardware IRQs to a non-CPU0 core? I know on Intel based CPUs this takes extra hardware, and many ARM based multiprocessors hardware IRQ migration depends on hardware which not all ARM multiprocessors have.

Sorry to reply late on this.
Jetson doesn’t have IRQ aggregator. In ARM GIC, affinity for SPI’s can be changed to any CPU by changing the corresponding GIC Distributor register.

1 Like

As far as AGX’s PCIe MSI interrupts are concerned, there is only one real / physical / wire interrupt for MSI and what we see in /proc/interrupts are virtual interrupts based on this one physical interrupt. So, affinity can be set to the physical line but not to the virtual MSI interrupts.

1 Like