This might be related, and will help explain at least part of it…
Whenever a kernel driver is to be run an interrupt occurs. The scheduler sees this, and based on scheduling rules, picks when to run a driver. To some extent, the scheduler is also responsible for choosing which CPU core to use (but that’s not the whole story).
It is important to understand that there are two types of interrupts (IRQs). One is a software IRQ, and these do not talk to hardware on physical addresses; they are purely software, e.g., calculating a checksum in software could be a soft IRQ. The part of the kernel which schedules and distributes software IRQs is ksoftirqd
.
Note that a soft IRQ might pick any CPU core, but due to having an understanding of cache hits (versus cache misses) the scheduler will tend to apply pressure to keep a given PID (or thread ID) on the same core (switching cores guarantees a loss of performance via cache misses).
When we are dealing with hardware drivers (a hardware IRQ), actual physical addresses must be used. In addition to this, an actual wire has to exist between the CPU core and the hardware. On an AMD style desktop PC there is an I/O APIC (asynchronous programmable I/O controller). That APIC can have its programming changed to route a hardware IRQ to any core, and has the correct wiring for that. There will be more cases of needing atomic code sections for a hardware driver than for a soft IRQ. One example of this is that if you want to pause an I/O transfer from one part of memory to another, then you can easily find ways to pause this and restart it later. Another example of when this cannot be done might be the same sort of I/O request between memory and the hard drive; the hard drive might not have the ability to pause in the middle of some sort of memory request the way a pure software program might.
It is considered good practice to take any hardware driver and divide operations which must be performed with the hardware from related operations which can be performed later in software. An example would be that a driver for an ethernet buffer, when the hardware IRQ is triggered, really must be performed in an atomic manner; however, if the checksum for traffic is not in hardware, then instead of putting the checksum code in the hardware driver, a separate software driver would be implemented, and the hardware driver would trigger a software IRQ. It might also be possible that if the checksum is in hardware, then perhaps this would be in two separate drivers, and each would be triggered by the hardware IRQ wire (each half could theoretically then run on separate cores).
I believe this is still the case with Orins, that much of the hardware has wiring only for CPU0 (CPU1 in top/jtop/htop). Every CPU of course has its own timers and memory controller access, and perhaps groups of GPIO could be changed to some other core, but for the most part hardware interrupts only have wiring for CPU0. Take a look at the output of:
cat /proc/interrupts
Those are hardware IRQs. Not everything running on CPU0 must be on CPU0 (the scheduler might not know to offload some software to other cores), but much of it must be on CPU0 due to lack of hardware IRQ wiring (I hope someone from NVIDIA might be able to say which hardware can go to any core).
For the purpose of efficiency, if you know of a software process running on CPU0, you might want to set affinity to another core. If the software runs on entirely one core, then there shouldn’t be any performance hit from cache misses (except for perhaps the first time the process runs, or if other processes are competing for cache).
Note that if you set a hardware IRQ to a core which cannot succeed due to wiring, that the scheduler will still try to go there. You might see the IRQ on that core. But then you’ll also see it back on CPU0 because the scheduler would eventually recognize that the other core isn’t available.
Each hardware IRQ takes a certain amount of time. Eventually, as CPU0 load goes up, and as more hardware IRQs hit that core, the CPU will eventually not have time to service all IRQs. This is known as “interrupt starvation”. About the only solution is to try to put all of the software drivers onto a different core. This might include rewriting an inefficient hardware IRQ and splitting off to a mix of hardware IRQ and software IRQ, with the software IRQ going to another core (which could be inefficient so far as cache goes, but it is better than completely losing a hardware IRQ).
I don’t think the hardware IRQ for ethernet can be offloaded to a different core. Someone from NVIDIA would have to say if that is possible (it might be now, but it wasn’t the case on earlier Jetsons).