Unbalanced CPU usage, CPU stalls and reboots

jetpack 4.3
Deepstream 4

I am experiencing unbalanced CPU usage - 90% on CPU1 but only 7-15% on the other CPUs

This is sometimes causing CPU stalls on CPU0 and reboots of the unit

What problem does high CPU0 usage point to?
Are there other logs / test I can look at?

thanks

RAM 6570/31919MB (lfb 5621x4MB) SWAP 0/15959MB (cached 0MB) CPU [100%@2265,20%@2265,57%@2265,8%@2265,9%@2265,11%@2265,11%@2265,7%@2265] EMC_FREQ 41%@2133 GR3D_FREQ 99%@1377 NVDEC 1190 NVDEC1 1190 APE 150 MTS fg 2% bg 0% AO@37C GPU@43.5C Tdiode@40.75C PMIC@100C AUX@35C CPU@40C thermal@39.1C Tboard@35C GPU 18424/12692 CPU 3224/762 SOC 7062/5989 CV 0/0 VDDRQ 2456/1815 SYS5V 3588/3144
RAM 6571/31919MB (lfb 5621x4MB) SWAP 0/15959MB (cached 0MB) CPU [100%@2265,20%@2265,12%@2265,12%@2265,9%@2265,9%@2265,13%@1278,12%@2265] EMC_FREQ 45%@2133 GR3D_FREQ 99%@1377 NVDEC 1190 NVDEC1 1190 APE 150 MTS fg 0% bg 1% AO@37C GPU@43.5C Tdiode@41C PMIC@100C AUX@35.5C CPU@39.5C thermal@39.1C Tboard@35C GPU 18278/12692 CPU 1843/762 SOC 7068/5989 CV 0/0 VDDRQ 2458/1815 SYS5V 3588/3144
RAM 6571/31919MB (lfb 5621x4MB) SWAP 0/15959MB (cached 0MB) CPU [100%@2265,15%@2265,11%@2265,33%@2265,18%@2265,10%@2265,9%@2265,9%@2265] EMC_FREQ 46%@2133 GR3D_FREQ 78%@1377 NVDEC 1190 NVDEC1 1190 APE 150 MTS fg 1% bg 0% AO@37C GPU@44C Tdiode@41.5C PMIC@100C AUX@35.5C CPU@40C thermal@39.1C Tboard@35C GPU 18424/12692 CPU 2763/762 SOC 7216/5989 CV 0/0 VDDRQ 2457/1815 SYS5V 3582/3144
RAM 6571/31919MB (lfb 5621x4MB) SWAP 0/15959MB (cached 0MB) CPU [100%@2265,18%@2265,15%@2265,6%@2265,39%@2265,9%@2265,7%@2265,8%@2265] EMC_FREQ 47%@2133 GR3D_FREQ 99%@1377 NVDEC 1190 NVDEC1 1190 APE 150 MTS fg 0% bg 0% AO@37.5C GPU@44.5C Tdiode@41.75C PMIC@100C AUX@36C CPU@40C thermal@39.75C Tboard@35C GPU 18577/12692 CPU 2304/762 SOC 7065/5989 CV 0/0 VDDRQ 2457/1815 SYS5V 3588/3144
RAM 6571/31919MB (lfb 5621x4MB) SWAP 0/15959MB (cached 0MB) CPU [100%@2265,21%@2265,20%@2265,12%@2265,14%@2265,9%@2265,15%@2265,6%@2265] EMC_FREQ 47%@2133 GR3D_FREQ 99%@1377 NVDEC 1190 NVDEC1 1190 APE 150 MTS fg 0% bg 0% AO@38C GPU@44.5C Tdiode@41.75C PMIC@100C AUX@36C CPU@40C thermal@39.45C Tboard@35C GPU 18585/12692 CPU 1689/762 SOC 7068/5989 CV 0/0 VDDRQ 2458/1815 SYS5V 3588/314

cpu_stall_reboot.txt (5.1 KB)
2_cpu_stall.txt (7.0 KB)
cpu-stall.txt (5.8 KB)
syslog_cpu_stalls.txt (356.4 KB)
syslog_cpustall2.txt (4.7 MB)

CPU0 is the core that handles hw interrupts, so probably there is one or several device(s) generating many ISRs.

Thanks for quick reply

I’ve no experience of looking at interrupts, seems to be quite a few things in CPU0 - is the normal or not?

interrupts.txt (21.9 KB)

I only have:
Xavier
network switch
2 x camera encoders

is it just a case of replacing bits until it stops happening?

Thank you

Just had another reboot

Managed to take this photo quickly before it happened

The only thing running on the Xavier was tegrastats.

Deepstream was not running and the encoders where no being accessed by the Xavier.
They were connected to the network still but idle.

Thanks

Seems weird…

Not sure what are your ‘2 x camera encoders’ but you may try to poweroff, remove these if external devices and reboot.
These might need some device tree modification and/or driver(s).

You may also tell:

  • what is your carrier board ? devkit or else ?
  • if you did some customization (kernel, DT, extlinux.conf,…) and if yes if you did some OTA upgrade since then.

If you are able to boot, use the following command:

dmesg > dmesg.log

in order to make a full log of kernel messages and upload it here.

That copy of “/proc/interrupts” seems to be “not quite right”. Could you post another copy of this without the “truncate” error?

FYI, it isn’t saying a lot, but the screenshot error seems to indicate a block device error combined with some driver not responding (probably because it needed the block device and the block device is having problems). Makes me kind of curious about what workload the block device is under. Does this have more mass storage than just the eMMC? Or is it purely eMMC? And if this is indeed a custom carrier board, then there might be firmware changes needed for eMMC or other block devices.

I have the AGX dev kit, no custom carrier board or extra peripherals

flashed with sdk manager

I’m using 2 x P7304 axis encoders connected via a Netgear network switch

Since posting this I investigated “bluetooth hostwake”

The dev kit doesn’t have bluetooth so I added bluedroid_pm to the blacklist and very quick initial tests suggests that reduced the cpu usage.

I’m not sure what you mean about a copy with the “truncate” error?

I only know to “cat /proc/interrupts”

The actual file was out of order and mentions a truncate error. You might try a literal cp of the file to somewhere else, or “cat /proc/interrupts > /some/where/else/interrupts.txt”. It is unlikely to show any error though.

FYI, not all interrupts can be handled on cores other than CPU0. Even when a driver is called by an interrupt it is unlikely to switch to a new core in most cases due to the scheduler trying to avoid a cache miss. Switching cores can reduce performance by a lot.

thanks,

I tried a few reboots to no avail. I pull rasp streams from the encoders into my deepstream app. I don’t think they have/need any drivers.

I have age dev kits, no custom carrier boards or peripherals.

I’ll post a dmesg log when I am back at work in the morning

thank you

Sorry didn’t notice the error when I posted it, I can see it now thanks.

I’ll have another go, and cp it tomorrow when I am back in work

Removing the bluetooth seems to have helped, but I have no idea why it was trying to work when there is no bluetooth on the dev kit?

Hi,

I’m just using a completely standard Xavier agx dev kit.
Rebooting the encoders does not change anything

Saved my dmesg log hopefully it shows something?

Thank you for taking a look

dmesg.txt (81.2 KB)

Hi,

A new interrupts.txt

Hopefully this shows something,

Many thanks for taking look

interrupts.txt (20.6 KB)

What is connected to the AGX? Especially, is anything PCIe connected? Not sure if it has any bearing or not, but is probably important if anything less common is attached.

Looks like you have several USB devices. Mice and keyboards don’t really consume any significant power, but if you have something consuming more power, then you’ll want to test with those running from an externally powered USB HUB (versus drawing power from the Jetson).

I see no errors, nor anything unusual, with the “/proc/interrupts”, aside from the fact that nothing is running other than the first core. The timers are more or less part of each core, and are unrelated to any applications you run. So in a sense, although the interrupts “sort of” don’t seem unusual, it does look like either (A) the system never got to the stage of enabling other cores, or (B) the nvpmodel was set to not run on those cores.

If you can get to the point of running the system, and you run “htop” ("sudo apt-get install htop"), do you see any use of any CPU core other than CPU0? Also, if you can get to a point where you can enter the command “sudo nvpmodel -m 0”, does that take effect and does it change what processors run?

Hi,

The only things connected to the Xavier are:

Netgear network switch

KVM with keyboard and mouse - the KVM is powered so shouldn’t be drawing any power.
KVM is connected via USB - no other usb devices connected at all.
Could switching the KVM cause issues?

Plugged into the network switch there are 2 axis encoders

Nothing else connected to the Xavier or the network switch.

I’m running on MAXN mode, but was possible at idle when I made the interrupts file.
Other cores do kick in when running my deepstream app

Can the KVM be causing any issues?

Thanks

I am experiencing freezes&reboot with my AGX Xavier DevKit as well when running CPU&GPU intensive applications in MAXN-mode. In my case same application runs fine when i use sudo nvpmodel -m 3 (“30W all”).
Can you please try “30W all” mode and report back if your issue disappears?

The reboots did still happen on 30W ALL but seem to have stopped now I removed bluedroid_pm.ko

It might be worth trying the same to see if that helps you run with MAXN?

Still trying to investigate the other interrupts as it doesn’t look right and my Deepstream app still randomly crashes and returns me to the desktop and I’m searching for any resolutions

thanks

1 Like

Thanks for performing test and suggestion to remove bluedroid_pm.
Unfortunately this did not solve the freeze/reboot issue in my scenario.

I suppose the KVM switching could cause issues, but it seems unlikely to cause this particular issue. Network connection shouldn’t be an issue either. You might try without the KVM, but I have no idea if that might be related to a stall or reboot. It is interesting though that @dkreutz has a similar experience, but don’t know what to suggest beyond this point other than trying without KVM. The fact that other cores do kick in under the right circumstances means the scheduler is doing its job correctly, and not having a lot of interrupts spread out on other cores is probably just the typical way things work when there isn’t enough of a load to use those other cores (the scheduler will try to keep many of the processes on the core they start with).