TK1 kernel oops : Terga4linux 21.5 : Freeze under high IRQ load / IRQ affinity

Hello,

We encourter some major issue on Jetson TK1, occuring after 12h of endurance test:

  • VI V4L2 freeze from CSI, due to syncpt timeout. V4L2 image grabbing never recover.
  • Kernel Freeze with lots oops talking about softirq.

Our application is quiet IRQ intensive and it have lots of critical IRQ sources, since we are using:

  • udc_tegra USB OTG Device mode: HID (Interrupt), Audio (IsoChronous), Video ( Bluk) ( bust : 6000 irq/s)
  • Syncpt ( dependancy of VI V4L2 tegra_camera driver) ( flat rate : 200 irq/s)
  • ETH0 Internal Realtek RTL8111/8168/8411 (unpredictable rate, burst seens at 2000 irq/s)
  • ETH1 mPCI Realtek RTL8111/8168/8411 (unpredictable rate, burst seens at 2000 irq/s)

We try to understand what is causing those kernel crashes.

On CPU0 we have NO userland high priority( SCHED_FIFO) threads, only processes with default scheduling .
On CPU1-2-3 we have SEVERAL userland high priority(SCHED_FIFO/RT) threads (30ms max task exec duration).

When we move GIC IRQ affinity of udc_tegra or syncpt to CPU 1-2-3, in order to dispatch IRQ handling :

  • udc_core : we have high latency on USB xmit/recv packet ( > 20ms), occurs immediatly.
  • syncpt : syncpt timeout, occurs within 1 or 2 hour.
    => This is worst , and definitively not a way to solve the issue.
    How to do explain this behaviour on those GIC IRQs?
    Some threads on this forum say that all irq are handled by CPU0, is that true ?
    I found this kernel piece of code : drivers/irqchip/irq-tegra.c
/* Set affinity for all interrupts to CPU0 */
	cpumask = 0x1;
	cpumask |= cpumask << 4;
	cpumask |= cpumask << 8;
	cpumask |= cpumask << 16;
	for (i = 0; i < (MAX_ICTLRS * ICTLR_IRQS_PER_LIC / 8); i++)
		ictlr_target_cpu[i] = cpumask;

Why the IRQ is disabled on CPU 1-2-3 by default on boot time ?

When we move PCIe-MSI IRQ affinity of ETH0 + ETH1 to CPU 1-2-3, in order to dispatch IRQ handling :

  • eth0/eth1 : no sign of high latency on ethernet ping.
    => This is not expected
    What are the difference of cpu affinity behaviour between GIC and PCIe-MSI IRQ ?

How could we solve this issue? it seems we have IRQ burst causing kernel oops. What is your reference tool for those kind of kernel troubleshooting ? Any JTAG probe software/hardware provider to advise ?

Why not using FIQ IRQ for syncpt, since it must have a realtime behaviour, due to CSI engine unability to recover after syncpt timeout ?

Hi Romary
Does this 12 hours long run have a lot of tests combination? What the result without VI or only run the VI part 12 hours?

For our application, this is a nominal usage.
This is meaningless for us to run without VI, nothing can be done without V4L2 image grabbing

All hardware IRQ are initially handled on CPU0. There is no wiring to reach remaining CPU for GPIO and many other hardware functions. A good hardware driver will divide itself into a minimal driver for dealing with hardware I/O, and then run the software-only parts as a separate function (which can migrate to non-CPU0). So for example the USB driver portion related to I/O will always start in CPU0, but remaining function (such as USB Video Class driver) can transfer to other cores.

On a desktop PC there is hardware to support distributing hardware IRQs to other cores. In ARM this is the case (it’s hard to cause IRQ starvation on a multi-core desktop PC).

Does this limitation (All hardware IRQ are initially handled on CPU0) affect all kind of IRQ : GIC? PCIe-MSI? GPIO ?

I don’t know if this is a technically accurate description, but basically any hardware physically wired for the CPU to interact with it only has a wire path to CPU0 without any ability for other cores to talk to the device. In some cases this might be indirect, for example, you may talk to the memory controller through CPU0 to do some operations; or you could also for example talk to the USB controller hardware to talk to something plugged into USB…whereas if the device were plugged into a serial port you’d be talking to the serial hardware to reach the device. Being able to talk to hardware on other cores would require a bus architecture for distributing IRQs…on Intel desktop CPUs they have an I/O APIC for this purpose, but no such device exists in the ARM world.

Data which has been received from a hardware IRQ could be put in memory or sent to some other address such that other cores can see it. Obviously the memory controller is one piece of hardware all cores can reach (hardware was designed to give all cores memory access). The author of a driver must expect much external hardware (and I mean this as logically external, e.g., through PCIe, GPIO, or USB, ignoring if the hardware is physically integrated within the chip) to be limited to talking to CPU0. It isn’t mandatory for work beyond I/O to be migrated into another interrupt (presumably a software interrupt or other kernel thread), but it is good taste and good driver design to do as little as possible when locking CPU0 and to then release CPU0 and let other cores do the rest of the work (ksoftirqd modernizes this).

You may be interested in knowing more about ksoftirqd. Here’s the man page:
[url]http://www.ms.sapientia.ro/~lszabo/unix_linux_hejprogramozas/man_en/htmlman9/ksoftirqd.9.html[/url]

And here is an article on software IRQs:
[url]https://lwn.net/Articles/520076/[/url]

Note that software IRQs can be one-shot, or they can exist as a thread which stays around waiting for future use. This might be one place a piece of hardware under constant load could be made a bit more efficient (at the cost of using more memory part of the time) if it is expected to get a lot of interrupts (I imagine there is more overhead if creating and destroying thousands of threads per second versus something which could just create one thread which could be called into context when work is available). The nice thing about badly behaving software IRQs is that there are several CPU cores and despite a performance hit when one of those cores is locked with a badly behaving driver the other cores can still handle some load. CPU0 is its own world though so far as most hardware is concerned and does not have that luxury.

There are actually more cores in the Tegra chips than 4 ARM (and the Denver cores of TX2) cores; those are just the general purpose cores you have access to. There is for example an ARM Cortex-A9 core for audio processing…I’m sure this core has its own hard wiring to some I/O, but if audio data is created or consumed it must eventually end up going to the outside world (e.g., USB headphones), which in turn means work consumed or produced on that core could still require CPU0 at some point.

Any time a driver is stopped from running at a time when it is needed (due to lack of core availability) you can call it IRQ starvation. Hardware IRQ starvation and software IRQ starvation probably need to be considered separately on ARM, but might not be much different on a desktop x86_64 PC.

Hello,
Thanks for your reply.

About IRQ on CPU0 , why when we put all PCI IRQs to CPU2, using affinity, it does seems to works ?

echo 4 > /proc/irq/131/smp_affinity

We know that IRQ must be as little piece of code as possible. But until now, we had no idea a CPU consumption by each IRQ routine.

We made a small script to analyse output of

perf script

based on perf event recording:

./perf record -a -g -e 'irq:irq_handler_entry' -e 'irq:irq_handler_exit'  sleep 2

Here is the result for our application, time is in Milliseconds :

GLOBAL STAT 14 IRQs DURATION=  2003.268(ms) IRQ COUNT=     37192(i) RATE= 18565(i/s)
======================== CPU 0 COUNT=     21441(i) RATE= 10703(i/s) %CPU=19.470% ===========================
   CPU    IRQ             NAME        NBH RATE H TOTAL TIME       %CPU   MIN TIME   AVG TIME   MAX TIME
     0     85    tegra12-i2c.4          3      1      0.026      0.001      0.008      0.009      0.009
     0     52        tegra-otg      10327   5155     29.721      1.484      0.001      0.003      0.007
     0     29       arch_timer        482    240      3.632      0.181      0.004      0.008      0.022
     0     97      host_syncpt        240    119      1.762      0.088      0.004      0.007      0.011
     0     77        tegra_mon         52     25      0.205      0.010      0.003      0.004      0.005
     0     77     tegra_actmon          2      0      0.008      0.000      0.003      0.004      0.005
     0     63             mmc0         10      4      0.063      0.003      0.003      0.006      0.018
     0     52        tegra-udc      10325   5154    354.623     17.702      0.006      0.034      0.151
======================== CPU 1 COUNT=       275(i) RATE=   137(i/s) %CPU=0.113% ===========================
   CPU    IRQ             NAME        NBH RATE H TOTAL TIME       %CPU   MIN TIME   AVG TIME   MAX TIME
     1     29       arch_timer        275    137      2.271      0.113      0.003      0.008      0.016
======================== CPU 2 COUNT=     14439(i) RATE=  7207(i/s) %CPU=4.990% ===========================
   CPU    IRQ             NAME        NBH RATE H TOTAL TIME       %CPU   MIN TIME   AVG TIME   MAX TIME
     2    643             eth1       6437   3213     21.845      1.090      0.002      0.003      0.006
     2    642             eth0        475    237      1.697      0.085      0.002      0.004      0.006
     2     29       arch_timer        701    349      4.313      0.215      0.001      0.006      0.018
     2    131         PCIe-MSI       6826   3407     72.113      3.600      0.007      0.011      0.024
======================== CPU 3 COUNT=      1037(i) RATE=   517(i/s) %CPU=0.298% ===========================
   CPU    IRQ             NAME        NBH RATE H TOTAL TIME       %CPU   MIN TIME   AVG TIME   MAX TIME
     3     29       arch_timer       1037    517      5.960      0.298      0.000      0.006      0.018

You can see that ETH0 and ETH1 IRQ are processed by CPU2, in respect with the affinity rules we added (echo 4 > /proc/irq/131/smp_affinity).

You said : “All hardware IRQ are initially handled on CPU0.” So what is hidden cost of using CPU affinity ?
=> Ethernet driver RTL8111 : no additionnal visible ( ping still below 1ms)
=> USB gadget Tegra UDC : additionnal latency visible ( 20ms or more to complete some USB request)

This little tool that analyse event allowed us to understand that tegra_udc (USB Device Controle), is using much more CPU than other IRQ: AVG/PEAK time is 34us/151us.

The IRQ rate from tegra_udc is driven by the max packet len, at least in bulk mode. In our USB gadget driver, we send UVC (webcam) stream, and some piece of code do memcpy. We will investiguate if some structural improvment can be made, using workqueue or threads.

Since we have about 8k irq/sec generating ‘irq:irq_handler_entry’ trace event at 18k call/sec , it really make sense for us to dispatch IRQs on several CPU if it is possible.

Why some affinities work or don’t I couldn’t tell you, I don’t know enough about the driver. Do understand though that affinity settings tend to be a request, and may not be honored. Starting code of the driver may run on CPU0 while a subset of the driver (for example ksoftirq threads) split off to another core…the different driver parts might have originated from the same driver and show up under the same name…part starts on CPU0, then a child migrates to this other core. Where a driver starts and where it finishes may not be the same location. You cannot start hardware access on a core which is not wired to touch that hardware.

On a multi-core Intel format desktop PC with I/O APIC (which schedules and distributes hardware IRQs) disabled CPU0 requests will occur with slightly lower latency. As the IRQ load goes up you will approach IRQ starvation much sooner than if the I/O APIC is enabled. In cases where I/O APIC is always on and PC hardware IRQs can go to any core there is a slight latency for the APIC to route to the scheduled core (it isn’t much). In this case where APIC is on it is very hard to get two hardware IRQs to compete to the point of IRQ starvation even on a dual core system. There is no I/O APIC on the ARM cores, but a scheduler still splits off some of the driver to another core, e.g., ksoftirq migration. Some latency is involved in transferring to another core, but the statistic you won’t normally see (but of which is very important) is that CPU0 may have delayed another driver had it not done this…you see the latency to migrate, you don’t have a measure of the latency saved for the next driver wanting CPU0. I’ve never done it, but I don’t see why a driver could not split off two threads if the software half of the driver work fits the model.

Two things you will always have trouble controlling is whether or not you get a cache hit or miss…the more a thread migrates to different cores, the more often you’ll get a cache miss…but other drivers and other code mean even if you run it all on one core you might still get some unpredictable cache misses. Sticking to one non-CPU0 core for a certain data set may be beneficial even if another core is ready to serve. The second thing you can’t control is that you didn’t write all of the hardware drivers…it’s very difficult to predict what someone else’s driver will do.

Romary
It’s better to attach the log with the sync point error. And detail information how your application run and the pipeline.

Hi Romary

Could you modify your app to grab frame data from videotestsrc or some where else to verify the VI timeout cause the problem.

Hi Romary

Can you disable all cpuidle and cpuquiet states to try.

for c in /sys/devices/system/cpu/cpu*/cpuidle/state*/disable; do echo 1 > $c; done

Hi,

We have launched a test during the last week-end but it failed.

Here is the cpuidle/cpuquiet states during this test:

/sys/devices/system/cpu/cpu0/cpuidle/state0/disabled : 0
/sys/devices/system/cpu/cpu0/cpuidle/state1/disabled : 0
/sys/devices/system/cpu/cpu0/cpuidle/state2/disabled : 1
/sys/devices/system/cpu/cpu1/cpuidle/state0/disabled : 0
/sys/devices/system/cpu/cpu1/cpuidle/state1/disabled : 0
/sys/devices/system/cpu/cpu2/cpuidle/state0/disabled : 0
/sys/devices/system/cpu/cpu2/cpuidle/state1/disabled : 0
/sys/devices/system/cpu/cpu3/cpuidle/state0/disabled : 0
/sys/devices/system/cpu/cpu3/cpuidle/state1/disabled : 0

Hi Romary
You should set all of them as 1 to disable the cpuidle. And could you still check the cpuquite/active that should be set as 0

Hi,

What is the meaning of these cpuidle states (In order to know the impact of these modifications) ?

CPU idle is a mechanism for staying at a low power state when CPU is not busy. Each idle state has different power consumption, entry/exit cost and minimum duration.

Hi Romary,

Have you clarified and resolved the problem?
Any information can be shared?

Thanks