How I do configure IRQ affinity on the K1? Normal way in Linux one write the cpu bitmap into /proc/interrupts/irq/xxx/smp_affinity. This didn’t work as I am getting invalid write (using root) on the Jetson K1.
It might be because hotplug is turning the CPUs off before I can set the smp_affinity mask but I don’t know how to turn it off other than perhaps a kernel build.
While I can’t guarantee this answer, it’s important to know that chipset differences between a desktop system and other architectures can get in the way of what you do with hardware IRQs.
Understand that on AMD desktop systems the architecture directly allows any CPU core to handle any IRQ; on an Intel desktop system, there is an added hardware chip, the asynchronous programmable interrupt controller (I/O-APIC) which allows routing any IRQ of any type to any core. Multi-core ARMv7 is similar to the Intel architecture, but the Tegra K1 lacks the equivalent to the APIC. This missing I/O-APIC has no effect on non-hardware (software) IRQs, but it also means that ONLY CPU0 can handle hardware IRQs (such as ethernet) on Tegra K1.
It is possible there are other reasons why your attempt to assign eth0 IRQ does not work, but a hardware interrupt in general can only go to CPU0. If there are non-hardware IRQs related to this it may be possible to assign affinity to those. I have not tried to assign any software IRQ so there may be other “hoops” to jump through even for software IRQ that I’m not aware of.
I am trying to attach an extra NIC to the card using the miniPCIe slot, as you guessed correctly, to increase throughput. The default S/W load generates very high interrupt load on CPU0 with the onboard NIC. Adding a 2nd NIC doesn’t appear to increase throughput because I suspect CPU0 is can’t handle the additional interrupt load.
I also have a second NIC in the mPCIe slot, set up for bridging mode, using the same driver as the integrated NIC. I have not done any significant performance testing, but noted a higher variance in ping times through this bridge as compared to the same bridge on a desktop (there are all kinds of reasons other than this the desktop has higher performance).
One thing I have considered but not yet done is try to segregate parts of the ethernet driver that MUST be in the hardware IRQ from parts that could be elsewhere; e.g., if the hardware IRQ handler is doing more work than needed, part could be put in user land or handed off to a software IRQ (which in turn could be handled on a different core, lightening the load on CPU0).
A second thing which needs to be considered if latency performance is to increase and if the bottleneck is the hardware IRQ handler is that any other software running on CPU0 would be competing for this core. Non-hardware IRQ software would already tend to properly hand off to other cores…I do not believe this would require any code changes as there is nothing to fix there. However, other hardware IRQ handlers for other drivers could also be at fault for delaying ethernet drivers. The concept of offloading everything possible from the ethernet hardware handler to a separate software handler does not matter unless it is the ethernet handler itself which is causing latency. Competing software running on CPU0 is either a question of scheduling or of problems with the other software holding the core too long.
So…what is really needed is a way to profile CPU0 time in hardware IRQ handlers, and especially of time the ethernet hardware IRQ handlers must wait when it is due to something else locking the core. I do not have that information. A poor man’s way of finding out part of this might be to disable as many hardware devices as possible, perhaps simply by unloading modules for drivers in module form…if ethernet improves significantly as one of those other drivers is removed then you would know this other driver needs to be optimized.
My case with two NICs running as a bridge is somewhat of a special case, as it guarantees the need to handle one device generates an immediate need to handle the second device.
If you look at /proc/interrupts, you can see cumulative information on IRQs; if you run “vmstat” you will find under the “system” column the “in” column which gives IRQ/sec. You’ll notice in /proc/interrupts (for tk1) the only column for a handler is CPU0, but if you look at a multi-core desktop (assuming it isn’t Intel format with no-apic option) that all cores are listed. This lack of any other column under the ARMv7 suggests that only hardware IRQs are listed, but I can’t confirm that. What I find interesting is that the tk1 vmstat “in” climbs high while doing a flood ping (ping -f) either from desktop host to tk1 or from tk1 to desktop host. Even so, my quad core AMD system (has nVidia chipset :) shows no climb of “in” vmstat column during flood pings. I’ve been curious as to how the desktop might be somehow combining or handling the ethernet IRQ without increasing latency (or perhaps I’m just reading vmstat wrong).
Is your second NIC using the same Realtek driver as the integrated NIC?
Interrupt handlers are usually separated into upper and lower (I think that’s Linux terminology). The really crucial part of the interrupt is handled by interrupt handler (copy data into buffer, register banging, etc). The bulk of the work is handed off to a high priority kernel thread.
The mini PCIe NIC I have uses the same chipset as the integrated NIC. But I don’t have it with me right now. So I am trying to move (some/all of) the interrupt processing to non-CPU0 CPU to see if it even possible without doing driver hacking. I haven’t done driver hacking in a long time and would rather avoid it if I can.
IRQ affinity works fine on Intel i5 but I never got it to work with the Tegra 3. Now trying with K1.
We are doing our own card design so I have some leeway in which chipset I use on our own cards. But for the TK1, we are stuck with what comes with the card.
Tegra 3 multi-core has the same behavior for IRQ on CPU0 as does Tegra 4 and Tegra K1 (probably also Tegra X1, but who knows what changes 64-bit brought…X1 is the first generation to not be ARMv7). I’ve experimented with bridges on both Tegra 3 and K1; K1 is just faster and better behaved, but Tegra 3 is not actually all that bad in comparison as a bridge. Whatever changes help for K1 would also help for 3.
The part I would really like to see is some sort of profiling to know if the latency is from slow execution of an IRQ, or if it is from waiting for the IRQ to be executed. Scheduling probably has room for tweaking; if start of execution is what lacks, then placing some of the work in high priority threads won’t gain much, while better scheduling would.