Jetson TX2 - CPU Core 0 Stuck at 100% Usage and System Crash Issues

Hello,

I have set up a recording device using a Jetson TX2 board connected to an Elroy capture card, along with both analog and digital cameras. Additionally, I am using a Peak System PCAN card to capture CAN data. This setup allows me to record CAN data from different computers while simultaneously recording camera footage.

However, I am experiencing an issue where CPU core 0 is consistently being utilized significantly more than the other cores, often reaching 100%. This high utilization on core 0 disrupts the running processes and sometimes even forces the device to reset.

I am wondering what might be causing core 0 to get stuck at 100% while the other cores remain less loaded (they’re around %1-2 and denver CPUs are around %40). I didn’t set any CPU affinity in any of my process’.

I would greatly appreciate any suggestions or insights that could help me resolve this issue.

EDITED:

I think I found something related to this problem. I got lots of interrupt signals from all processes which are send directly to the core 0. I’m currently trying to find a way to handle IRQ signals equally for all cores. I changed the /proc/irq/xx/smp_affinity but it still sets it to a single core. When I reboot, I see it set to core 0 as before.´´´

Hi gobildolma,

Are you using the devkit or custom board for TX2?
What’s the Jetpack version in use?

You can just run top command to check what process occupies the resource.

Please share the result of cat /proc/interrupts for further check.

$ cat /etc/nv_tegra_release
R32 (release), REVISION: 4.4, GCID: 23942405, BOARD: t186ref, EABI: aarch64, DATE: Fri Oct 16 …

I’m using custom board.

when I wrote cat /proc/interrupts I see;

$ cat /proc/interrupts 
           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       
  1:          0          0          0          0          0          0     GICv2  29 Level     trusty
  2:          0          0          0          0          0          0     GICv2  30 Level     arch_timer
  5:    1363246          0          0          0          0          0     GICv2  32 Level     tegra186_timer0
  6:          0    2681644          0          0          0          0     GICv2  33 Level     tegra186_timer1
  7:          0          0    2837747          0          0          0     GICv2  34 Level     tegra186_timer2
  8:          0          0          0    1276338          0          0     GICv2  35 Level     tegra186_timer3
  9:          0          0          0          0    1453316          0     GICv2  36 Level     tegra186_timer4
 10:          0          0          0          0          0    1039170     GICv2  37 Level     tegra186_timer5
 11:    4632953          0          0          0          0          0     GICv2 208 Level     hsp
 12:          0          0          0          0          0          0     GICv2 202 Level     arm-smmu global fault
 13:          0          0          0          0          0          0     GICv2 203 Level     arm-smmu global fault
 21:     102473          0          0          0          0          0     GICv2  97 Level     mmc0
 22:    2325974          0          0          0          0          0     GICv2  96 Level     mmc1
 23:          0          0          0          0          0          0     GICv2  94 Level     mmc2
 24:          0          0          0          0          0          0     GICv2 229 Level     tegra-ahci[3507000.ahci-sata]
 25:        207          0          0          0          0          0     GICv2  57 Level     3160000.i2c
 26:          2          0          0          0          0          0     GICv2  58 Level     c240000.i2c
 27:        288          0          0          0          0          0     GICv2  59 Level     3180000.i2c
 28:         46          0          0          0          0          0     GICv2  60 Level     3190000.i2c
 29:          0          0          0          0          0          0     GICv2  62 Level     31b0000.i2c
 30:          0          0          0          0          0          0     GICv2  63 Level     31c0000.i2c
 31:         59          0          0          0          0          0     GICv2  64 Level     c250000.i2c
 32:          0          0          0          0          0          0     GICv2  65 Level     31e0000.i2c
 33:          0          0          0          0          0          0     GICv2  68 Level     3210000.spi
 34:          0          0          0          0          0          0     GICv2  69 Level     c260000.spi
 35:          0          0          0          0          0          0     GICv2  71 Level     3240000.spi
 41:          0          0          0          0          0          0     GICv2 226 Level     ether_qos.common_irq
 43:          0          0          0          0          0          0     GICv2 222 Level     2490000.ether_qos.rx0
 44:          0          0          0          0          0          0     GICv2 218 Level     2490000.ether_qos.tx0
 51:          0          0          0          0          0          0     GICv2  48 Level     b000000.rtcpu
 52:       3846          0          0          0          0          0     GICv2 242 Level     d230000.actmon
 53:     641927          0          0          0          0          0     GICv2 297 Level     host_syncpt
 54:          2          0          0          0          0          0     GICv2 295 Level     host_status
 56:          0          0          0          0          0          0     GICv2 233 Level     15700000.vi
 59:          0          0          0          0          0          0     GICv2 237 Level     tegra-isp-isr
 60:    1006388          0          0          0          0          0     GICv2 186 Level     15210000.nvdisplay
 61:          0          0          0          0          0          0     GICv2 238 Level     vic
 65:          0          0          0          0          0          0        PM  42 Level     tegra_rtc
 66:          0          0          0          0          0          0     GICv2 255 Level     mc_status
 70:      97708          0          0          0          0          0        PM 195 Level     xhci-hcd:usb1
 71:          1          0          0          0          0          0        PM 196 Level     3530000.xhci
 72:          0          0          0          0          0          0        PM 199 Level     3530000.xhci
 73:          0          0          0          0          0          0     GICv2 198 Level     3550000.xudc
 74:     201810          0          0          0          0          0     GICv2 102 Level     gk20a_stall
 75:          0          0          0          0          0          0     GICv2 103 Level     gk20a_nonstall
 77:          0          0          0          0          0          0     GICv2 315 Level     3ad0000.se_elp
 78:         31          0          0          0          0          0     GICv2 173 Level     b150000.tegra-hsp, b150000.tegra-hsp, b150000.tegra-hsp
 82:          0          0          0          0          0          0     GICv2 165 Level     c150000.tegra-hsp
 93:          0          0          0          0          0          0     GICv2 107 Level     gpcdma.0
 94:          0          0          0          0          0          0     GICv2 108 Level     gpcdma.1
 95:          0          0          0          0          0          0     GICv2 109 Level     gpcdma.2
 96:          0          0          0          0          0          0     GICv2 110 Level     gpcdma.3
 97:          0          0          0          0          0          0     GICv2 111 Level     gpcdma.4
 98:          0          0          0          0          0          0     GICv2 112 Level     gpcdma.5
 99:          0          0          0          0          0          0     GICv2 113 Level     gpcdma.6
100:          0          0          0          0          0          0     GICv2 114 Level     gpcdma.7
101:          0          0          0          0          0          0     GICv2 115 Level     gpcdma.8
102:          0          0          0          0          0          0     GICv2 116 Level     gpcdma.9
103:         46          0          0          0          0          0     GICv2 117 Level     gpcdma.10
104:         46          0          0          0          0          0     GICv2 118 Level     gpcdma.11
105:          0          0          0          0          0          0     GICv2 119 Level     gpcdma.12
106:          0          0          0          0          0          0     GICv2 120 Level     gpcdma.13
107:          0          0          0          0          0          0     GICv2 121 Level     gpcdma.14
108:          0          0          0          0          0          0     GICv2 122 Level     gpcdma.15
109:          0          0          0          0          0          0     GICv2 123 Level     gpcdma.16
110:          0          0          0          0          0          0     GICv2 124 Level     gpcdma.17
111:          0          0          0          0          0          0     GICv2 125 Level     gpcdma.18
112:          0          0          0          0          0          0     GICv2 126 Level     gpcdma.19
113:          0          0          0          0          0          0     GICv2 127 Level     gpcdma.20
114:          0          0          0          0          0          0     GICv2 128 Level     gpcdma.21
115:          0          0          0          0          0          0     GICv2 129 Level     gpcdma.22
116:          0          0          0          0          0          0     GICv2 130 Level     gpcdma.23
117:          0          0          0          0          0          0     GICv2 131 Level     gpcdma.24
118:          0          0          0          0          0          0     GICv2 132 Level     gpcdma.25
119:          0          0          0          0          0          0     GICv2 133 Level     gpcdma.26
120:          0          0          0          0          0          0     GICv2 134 Level     gpcdma.27
121:          0          0          0          0          0          0     GICv2 135 Level     gpcdma.28
122:          0          0          0          0          0          0     GICv2 136 Level     gpcdma.29
123:          0          0          0          0          0          0     GICv2 137 Level     gpcdma.30
124:          0          0          0          0          0          0     GICv2 138 Level     gpcdma.31
233:          0          0          0          0          0          0  tegra-gpio 101 Level     phy_interrupt
253:         13          0          0          0          0          0  tegra-gpio 121 Edge      15210000.nvdisplay
257:          0          0          0          0          0          0  tegra-gpio 125 Edge      3400000.sdhci cd
291:          0          0          0          0          0          0  tegra-gpio 159 Edge      external-connection:extcon@1
333:          0          0          0          0          0          0  tegra-gpio-aon  16 Level     tmp451
373:          0          0          0          0          0          0  tegra-gpio-aon  56 Edge      Power
374:          0          0          0          0          0          0  tegra-gpio-aon  57 Edge      Volume Up
375:          0          0          0          0          0          0  tegra-gpio-aon  58 Edge      Volume Down
376:     320834          0          0          0          0          0  tegra-gpio-aon  59 Level     bcmsdh_sdmmc
377:          1          0          0          0          0          0  tegra-gpio-aon  60 Edge      bluetooth hostwake
381:          0          0          0          0          0          0     GICv2 104 Level     PCIE
383:       3032          0          0          0          0          0     GICv2 193 Level     snd_hda_tegra
384:        529          0          0          0          0          0     GICv2  39 Level     30c0000.watchdog
390:          0          0          0          0          0          0     GICv2 194 Level     cec_irq
391:          0          0          0          0          0          0        PM 241 Edge      max77620-top
395:          0          0          0          0          0          0  max77620-top   3 Edge      max77620-gpio
396:          0          0          0          0          0          0  max77620-top   4 Edge      max77686-rtc
400:          0          0          0          0          0          0  max77620-top   8 Edge      max77620-thermal
401:          0          0          0          0          0          0  max77620-top   9 Edge      max77620-thermal
402:          0          0          0          0          0          0  max77620-gpio   0 Edge      external-connection:extcon@1
411:       1492          0          0          0          0          0  agic-controller  32 Level   
412:        106          0          0          0          0          0  agic-controller  33 Level   
456:          0          0          0          0          0          0  max77686-rtc   1 Edge      rtc-alarm1
IPI0:   1224123    2848682    2529323    2173995    1857300    2299279       Rescheduling interrupts
IPI1:      1529      37645      37673       1151       1051        903       Function call interrupts
IPI2:         0          0          0          0          0          0       CPU stop interrupts
IPI3:         0          0          0          0          0          0       Timer broadcast interrupts
IPI4:    159342     413222     390315     143587     137725     162437       IRQ work interrupts
IPI5:         0          0          0          0          0          0       CPU wake-up interrupts

is there any way I can balance them properly without having degrade in my device’s performance.

I will add some information on this, but it won’t answer your question (maybe it will lead to different questions).

There are two kinds of interrupt: Software and hardware. You will find hardware interrupts are listed in /proc/interrupts. Hardware interrupts require an actual wire to trigger the scheduler, while software IRQs are more “virtual”. Either way, it is the scheduler which chooses when to send a process to a core, and for software IRQs, this can always go to any core. In the case of a hardware IRQ the scheduler can only send the process to a core with a driver that can be “wired” (with copper traces on the PCB) directly or indirectly to a given core.

On a desktop PC there are different ways to do this between Intel and AMD CPUs, but the Intel one is a good example. What Intel does is it has a “programmable I/O interrupt controller” (an IO-APIC). This controller is just a kind of switch to route the wiring to the correct core. If one were to migrate some hardware (e.g., a GPU or network card) to a specific core, then when scheduling find the hardware IRQ, it will set up the IO-APIC to route everything needed to that specific core, and will load the proper driver or other content onto the core.

Sometimes entire groups of hardware must migrate together, e.g., the GPIO.

On Jetsons there are many hardware IRQs which can only be routed to CPU 0. This is because there is no IO-APIC, and no method to route to other cores on some hardware. If that is the case, and if someone schedules for another core, then when the time comes the scheduler will reschedule to run on CPU 0.

When one designs a hardware driver it is “best practice” to perform only the very minimal work in that driver, and to then create a different driver for “soft” IRQ functions. An example would be a network card might get data from the hardware, and then issue a soft interrupt to schedule computing a checksum. This would be preferable (if checksum is computed in CPU) to performing the entire step in a single driver for the network card since it allows context switching between hard and soft IRQs.

At some point your CPU core might reach 100%, and it won’t be due to error. Instead, it is simply due to the amount of drivers which run on that CPU core in combination with the time required to service each hard IRQ. When there is no longer sufficient time to schedule everything needed, this becomes “IRQ starvation”.

One often runs everything from a given process on a single core for performance reasons. Each core has its own cache, and a cache miss is much slower than running with a cache hit. If the two interrupts (hard+soft) use the same data, then typically you get more cache hits by being on a single core. The scheduler tends to consider this the default case, and this is the reason why even if you have software IRQs which can be offloaded, they may still schedule on CPU 0. However, if there is IRQ starvation, then it might still be better to reschedule or offload a soft IRQ to another core. This other core will tend to mean you get more cache misses, but it is better than IRQ starvation.

You are correct to ask about too much load on CPU 0, but I can’t tell you which parts of your hardware can be scheduled on a different CPU core. If there is some hardware where the hardware IRQ itself can go to another core, then this would be the best situation. Quite often though, on a Jetson, you will find parts of the hardware can only reach CPU 0. If it turns out that this hardware IRQ and driver later triggers a software IRQ, then the default is to run that soft IRQ on the same core, but one could aid in preventing IRQ starvation by rescheduling the soft IRQ to a different core (at the likely expense of a cache miss).

I couldn’t tell you though which IRQs must stay on CPU 0, nor which soft IRQs are on that core which could be rescheduled. If you run top or htop, then you’ll notice the ksoftirqd as the daemon which schedules software interrupts for a given core. If you can identify a software IRQ running on CPU 0, then this would be a candidate to move to another core. If there isn’t a significant CPU 0 soft IRQ user, then you might be out of luck.

I don’t know in your case what can be moved where. There might or might not be some tuning available.

1 Like

First of all, I would like to sincerely thank you; I have never received this much information from my university professors. :)

I am currently exploring ways to manage IRQs more efficiently, as most of my peripherals are connected via the PCIe port, which limits my flexibility on this issue.

If you could provide additional insights on this topic, I would be eagerly awaiting them.

Much of what you need to know depends specifically on the Jetson hardware. It shouldn’t crash at high load on CPU 0, so that still needs to be solved. The reasons for the high CPU 0 load though come from so many interrupts going to CPU 0, and this in turn is usually because the scheduler put them there; there is much hardware on the Jetson which the scheduler has no choice to schedule there, but there may be some which can migrate to another core. That’s something @KevinFFF might answer, but I think if CPU 0 is under high load, or if you need too much “real time” behavior, you might be unable to go further. However, the part which causes the system crash, if solved, has a strong chance of making everything else running on CPU 0 improve its behavior.

An example is that I think the GPIO complex can have the entire group moved to another core, but I’m not certain on that. If you are not using GPIO, or if it is low load, then there is no help in moving that to a new core.

There might also be a way to slightly manipulate priorities, but this is unlikely to stop crashes. I wouldn’t know where to look further without knowing what specifically is causing the system crash (maybe it is random due to high CPU 0 load; maybe it is one specific process being starved which causes the crash…the two would be very different problems even though they would begin as looking the same). I have no way to reproduce this.

There is an RT kernel extension, but this does not magically solve anything. What it does is to allow finer tuning of scheduling and priorities, but it still has the same limitations as before regarding something on CPU 0 requiring that core due to wiring. If you don’t know what to tune, then the RT extension wouldn’t do much other than complicating the issue. Incidentally, because of hardware, the RT kernel only provides a “soft” real time extension. To get true “hard” RT you would need ARM Cortex-R hardware or a Cortex-M (the “-R” has hard real time scheduling aid hardware; the “-M” can do hard RT, but it is very limited on the load it can handle).

Perhaps one of the biggest aids to answering this is if you can experiment such that you find there is one single process which triggers the crash. For example, if the crash never occurs with everything except one of the processes running, and always occurs by adding a last process.

You are at a big disadvantage here though because the TX2 is only in a maintenance life now, and there is no new feature development. This is not a new feature, and I suppose it could qualify for a bug fix, but most of that is reserved for security fixes. If it were easy to identify the exact bug or issue could change whether or not it is considered useful to issue a patch for. Without knowing the exact issue it is unlikely to get any sort of support in the maintenance part of the TX2’s life cycle.

I know you are going to hate this answer, but right now Orin is still in full development. I think it won’t be long before another new hardware is being presented as the next generation after Orin, but maybe (I don’t know) there will be a price drop for Orin when the next generation hits the market. Or maybe there is an educational discount (I don’t know, but there is often such a discount). Switching to Orin would be a drastic improvement in hardware performance and support.


Some interesting stories about the first true hard realtime computer, used in the Apollo moon lander; this is what happens when a hard RT core reaches its limits, versus how a “regular” computer just starts failing:
https://www.youtube.com/watch?v=B1J2RMorJXM

https://www.youtube.com/watch?v=Pfj0cUpjmug

1 Like

It is expected since it is only valid until shutdown.

Which IRQ would you like to be handled by other CPU?

Please also share the result of top command on your board to check what occupied the CPU resource.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.