We encourter some major issue on Jetson TK1, occuring after 12h of endurance test:
- VI V4L2 freeze from CSI, due to syncpt timeout. V4L2 image grabbing never recover.
- Kernel Freeze with lots oops talking about softirq.
Our application is quiet IRQ intensive and it have lots of critical IRQ sources, since we are using:
- udc_tegra USB OTG Device mode: HID (Interrupt), Audio (IsoChronous), Video ( Bluk) ( bust : 6000 irq/s)
- Syncpt ( dependancy of VI V4L2 tegra_camera driver) ( flat rate : 200 irq/s)
- ETH0 Internal Realtek RTL8111/8168/8411 (unpredictable rate, burst seens at 2000 irq/s)
- ETH1 mPCI Realtek RTL8111/8168/8411 (unpredictable rate, burst seens at 2000 irq/s)
We try to understand what is causing those kernel crashes.
On CPU0 we have NO userland high priority( SCHED_FIFO) threads, only processes with default scheduling .
On CPU1-2-3 we have SEVERAL userland high priority(SCHED_FIFO/RT) threads (30ms max task exec duration).
When we move GIC IRQ affinity of udc_tegra or syncpt to CPU 1-2-3, in order to dispatch IRQ handling :
- udc_core : we have high latency on USB xmit/recv packet ( > 20ms), occurs immediatly.
- syncpt : syncpt timeout, occurs within 1 or 2 hour.
=> This is worst , and definitively not a way to solve the issue.
How to do explain this behaviour on those GIC IRQs?
Some threads on this forum say that all irq are handled by CPU0, is that true ?
I found this kernel piece of code : drivers/irqchip/irq-tegra.c
/* Set affinity for all interrupts to CPU0 */ cpumask = 0x1; cpumask |= cpumask << 4; cpumask |= cpumask << 8; cpumask |= cpumask << 16; for (i = 0; i < (MAX_ICTLRS * ICTLR_IRQS_PER_LIC / 8); i++) ictlr_target_cpu[i] = cpumask;
Why the IRQ is disabled on CPU 1-2-3 by default on boot time ?
When we move PCIe-MSI IRQ affinity of ETH0 + ETH1 to CPU 1-2-3, in order to dispatch IRQ handling :
- eth0/eth1 : no sign of high latency on ethernet ping.
=> This is not expected
What are the difference of cpu affinity behaviour between GIC and PCIe-MSI IRQ ?
How could we solve this issue? it seems we have IRQ burst causing kernel oops. What is your reference tool for those kind of kernel troubleshooting ? Any JTAG probe software/hardware provider to advise ?
Why not using FIQ IRQ for syncpt, since it must have a realtime behaviour, due to CSI engine unability to recover after syncpt timeout ?