Unbalanced CPU usage, CPU stalls and reboots

cdevd · May 26, 2021, 10:55am

jetpack 4.3
Deepstream 4

I am experiencing unbalanced CPU usage - 90% on CPU1 but only 7-15% on the other CPUs

This is sometimes causing CPU stalls on CPU0 and reboots of the unit

What problem does high CPU0 usage point to?
Are there other logs / test I can look at?

thanks

RAM 6570/31919MB (lfb 5621x4MB) SWAP 0/15959MB (cached 0MB) CPU [100%@2265,20%@2265,57%@2265,8%@2265,9%@2265,11%@2265,11%@2265,7%@2265] EMC_FREQ 41%@2133 GR3D_FREQ 99%@1377 NVDEC 1190 NVDEC1 1190 APE 150 MTS fg 2% bg 0% AO@37C GPU@43.5C Tdiode@40.75C PMIC@100C AUX@35C CPU@40C thermal@39.1C Tboard@35C GPU 18424/12692 CPU 3224/762 SOC 7062/5989 CV 0/0 VDDRQ 2456/1815 SYS5V 3588/3144
RAM 6571/31919MB (lfb 5621x4MB) SWAP 0/15959MB (cached 0MB) CPU [100%@2265,20%@2265,12%@2265,12%@2265,9%@2265,9%@2265,13%@1278,12%@2265] EMC_FREQ 45%@2133 GR3D_FREQ 99%@1377 NVDEC 1190 NVDEC1 1190 APE 150 MTS fg 0% bg 1% AO@37C GPU@43.5C Tdiode@41C PMIC@100C AUX@35.5C CPU@39.5C thermal@39.1C Tboard@35C GPU 18278/12692 CPU 1843/762 SOC 7068/5989 CV 0/0 VDDRQ 2458/1815 SYS5V 3588/3144
RAM 6571/31919MB (lfb 5621x4MB) SWAP 0/15959MB (cached 0MB) CPU [100%@2265,15%@2265,11%@2265,33%@2265,18%@2265,10%@2265,9%@2265,9%@2265] EMC_FREQ 46%@2133 GR3D_FREQ 78%@1377 NVDEC 1190 NVDEC1 1190 APE 150 MTS fg 1% bg 0% AO@37C GPU@44C Tdiode@41.5C PMIC@100C AUX@35.5C CPU@40C thermal@39.1C Tboard@35C GPU 18424/12692 CPU 2763/762 SOC 7216/5989 CV 0/0 VDDRQ 2457/1815 SYS5V 3582/3144
RAM 6571/31919MB (lfb 5621x4MB) SWAP 0/15959MB (cached 0MB) CPU [100%@2265,18%@2265,15%@2265,6%@2265,39%@2265,9%@2265,7%@2265,8%@2265] EMC_FREQ 47%@2133 GR3D_FREQ 99%@1377 NVDEC 1190 NVDEC1 1190 APE 150 MTS fg 0% bg 0% AO@37.5C GPU@44.5C Tdiode@41.75C PMIC@100C AUX@36C CPU@40C thermal@39.75C Tboard@35C GPU 18577/12692 CPU 2304/762 SOC 7065/5989 CV 0/0 VDDRQ 2457/1815 SYS5V 3588/3144
RAM 6571/31919MB (lfb 5621x4MB) SWAP 0/15959MB (cached 0MB) CPU [100%@2265,21%@2265,20%@2265,12%@2265,14%@2265,9%@2265,15%@2265,6%@2265] EMC_FREQ 47%@2133 GR3D_FREQ 99%@1377 NVDEC 1190 NVDEC1 1190 APE 150 MTS fg 0% bg 0% AO@38C GPU@44.5C Tdiode@41.75C PMIC@100C AUX@36C CPU@40C thermal@39.45C Tboard@35C GPU 18585/12692 CPU 1689/762 SOC 7068/5989 CV 0/0 VDDRQ 2458/1815 SYS5V 3588/314

cpu_stall_reboot.txt (5.1 KB)
2_cpu_stall.txt (7.0 KB)
cpu-stall.txt (5.8 KB)
syslog_cpu_stalls.txt (356.4 KB)
syslog_cpustall2.txt (4.7 MB)

Honey_Patouceul · May 26, 2021, 11:04am

CPU0 is the core that handles hw interrupts, so probably there is one or several device(s) generating many ISRs.

cdevd · May 26, 2021, 11:23am

Thanks for quick reply

I’ve no experience of looking at interrupts, seems to be quite a few things in CPU0 - is the normal or not?

interrupts.txt (21.9 KB)

I only have:
Xavier
network switch
2 x camera encoders

is it just a case of replacing bits until it stops happening?

Thank you

cdevd · May 26, 2021, 12:10pm

Just had another reboot

Managed to take this photo quickly before it happened

The only thing running on the Xavier was tegrastats.

Deepstream was not running and the encoders where no being accessed by the Xavier.
They were connected to the network still but idle.

Thanks

Honey_Patouceul · May 26, 2021, 7:50pm

Seems weird…

Not sure what are your ‘2 x camera encoders’ but you may try to poweroff, remove these if external devices and reboot.
These might need some device tree modification and/or driver(s).

You may also tell:

what is your carrier board ? devkit or else ?
if you did some customization (kernel, DT, extlinux.conf,…) and if yes if you did some OTA upgrade since then.

If you are able to boot, use the following command:

dmesg > dmesg.log

in order to make a full log of kernel messages and upload it here.

linuxdev · May 26, 2021, 8:36pm

That copy of “/proc/interrupts” seems to be “not quite right”. Could you post another copy of this without the “truncate” error?

FYI, it isn’t saying a lot, but the screenshot error seems to indicate a block device error combined with some driver not responding (probably because it needed the block device and the block device is having problems). Makes me kind of curious about what workload the block device is under. Does this have more mass storage than just the eMMC? Or is it purely eMMC? And if this is indeed a custom carrier board, then there might be firmware changes needed for eMMC or other block devices.

cdevd · May 26, 2021, 9:06pm

I have the AGX dev kit, no custom carrier board or extra peripherals

flashed with sdk manager

I’m using 2 x P7304 axis encoders connected via a Netgear network switch

Since posting this I investigated “bluetooth hostwake”

The dev kit doesn’t have bluetooth so I added bluedroid_pm to the blacklist and very quick initial tests suggests that reduced the cpu usage.

I’m not sure what you mean about a copy with the “truncate” error?

I only know to “cat /proc/interrupts”

linuxdev · May 26, 2021, 9:09pm

The actual file was out of order and mentions a truncate error. You might try a literal cp of the file to somewhere else, or “cat /proc/interrupts > /some/where/else/interrupts.txt”. It is unlikely to show any error though.

FYI, not all interrupts can be handled on cores other than CPU0. Even when a driver is called by an interrupt it is unlikely to switch to a new core in most cases due to the scheduler trying to avoid a cache miss. Switching cores can reduce performance by a lot.

cdevd · May 26, 2021, 9:10pm

thanks,

I tried a few reboots to no avail. I pull rasp streams from the encoders into my deepstream app. I don’t think they have/need any drivers.

I have age dev kits, no custom carrier boards or peripherals.

I’ll post a dmesg log when I am back at work in the morning

thank you

cdevd · May 26, 2021, 9:15pm

Sorry didn’t notice the error when I posted it, I can see it now thanks.

I’ll have another go, and cp it tomorrow when I am back in work

Removing the bluetooth seems to have helped, but I have no idea why it was trying to work when there is no bluetooth on the dev kit?

cdevd · May 27, 2021, 7:39am

Hi,

I’m just using a completely standard Xavier agx dev kit.
Rebooting the encoders does not change anything

Saved my dmesg log hopefully it shows something?

Thank you for taking a look

dmesg.txt (81.2 KB)

cdevd · May 27, 2021, 7:41am

Hi,

A new interrupts.txt

Hopefully this shows something,

Many thanks for taking look

interrupts.txt (20.6 KB)

linuxdev · May 27, 2021, 7:02pm

What is connected to the AGX? Especially, is anything PCIe connected? Not sure if it has any bearing or not, but is probably important if anything less common is attached.

Looks like you have several USB devices. Mice and keyboards don’t really consume any significant power, but if you have something consuming more power, then you’ll want to test with those running from an externally powered USB HUB (versus drawing power from the Jetson).

I see no errors, nor anything unusual, with the “/proc/interrupts”, aside from the fact that nothing is running other than the first core. The timers are more or less part of each core, and are unrelated to any applications you run. So in a sense, although the interrupts “sort of” don’t seem unusual, it does look like either (A) the system never got to the stage of enabling other cores, or (B) the nvpmodel was set to not run on those cores.

If you can get to the point of running the system, and you run “htop” (“sudo apt-get install htop”), do you see any use of any CPU core other than CPU0? Also, if you can get to a point where you can enter the command “sudo nvpmodel -m 0”, does that take effect and does it change what processors run?

cdevd · May 28, 2021, 8:00am

Hi,

The only things connected to the Xavier are:

Netgear network switch

KVM with keyboard and mouse - the KVM is powered so shouldn’t be drawing any power.
KVM is connected via USB - no other usb devices connected at all.
Could switching the KVM cause issues?

Plugged into the network switch there are 2 axis encoders

Nothing else connected to the Xavier or the network switch.

I’m running on MAXN mode, but was possible at idle when I made the interrupts file.
Other cores do kick in when running my deepstream app

Can the KVM be causing any issues?

Thanks

dkreutz · May 28, 2021, 9:50am

I am experiencing freezes&reboot with my AGX Xavier DevKit as well when running CPU&GPU intensive applications in MAXN-mode. In my case same application runs fine when i use sudo nvpmodel -m 3 (“30W all”).
Can you please try “30W all” mode and report back if your issue disappears?

cdevd · May 28, 2021, 11:07am

The reboots did still happen on 30W ALL but seem to have stopped now I removed bluedroid_pm.ko

It might be worth trying the same to see if that helps you run with MAXN?

Still trying to investigate the other interrupts as it doesn’t look right and my Deepstream app still randomly crashes and returns me to the desktop and I’m searching for any resolutions

thanks

dkreutz · May 28, 2021, 11:37am

Thanks for performing test and suggestion to remove bluedroid_pm.
Unfortunately this did not solve the freeze/reboot issue in my scenario.

linuxdev · May 28, 2021, 6:17pm

I suppose the KVM switching could cause issues, but it seems unlikely to cause this particular issue. Network connection shouldn’t be an issue either. You might try without the KVM, but I have no idea if that might be related to a stall or reboot. It is interesting though that @dkreutz has a similar experience, but don’t know what to suggest beyond this point other than trying without KVM. The fact that other cores do kick in under the right circumstances means the scheduler is doing its job correctly, and not having a lot of interrupts spread out on other cores is probably just the typical way things work when there isn’t enough of a load to use those other cores (the scheduler will try to keep many of the processes on the core they start with).

Topic		Replies	Views
High CPU usage when idle Jetson AGX Xavier hw , ros , kernel , nvbugs	34	3042	April 27, 2020
High CPU Usage with Intel 8265 on Jetson Xavier AGX JetPack 4.4.1 Jetson AGX Xavier wifi , bluetooth	23	1427	September 24, 2021
CPU/SSH stalled while running gstreamer pipeline in maxn mode Jetson AGX Xavier kernel , gstreamer , nvbugs	12	1747	August 29, 2021
Unstable performance across multiple Jetson AGX Xavier devices DeepStream SDK fps , jetson , deepstream	5	815	October 30, 2023
AGX Xavier reboots after 1 core cpu loads 100% Jetson AGX Xavier camera	23	2292	September 22, 2021
AGX Xavier easy to crash when ethernet network connected Jetson AGX Xavier ethernet	38	4690	October 18, 2021
AGX Xavier - Boot Hangs - what does this error mean? Jetson AGX Xavier boot , reflash	17	49	November 8, 2024
Jetson AGX Xavier suddenly reboot Jetson AGX Xavier	11	1371	October 18, 2021
AGX Xavier power supply: very sensitive to voltage variation Jetson AGX Xavier power , nvbugs	31	3206	October 18, 2021
Jetson AGX Xavier eqos ethernet driver causing kernel panic Jetson AGX Xavier ethernet	5	999	August 8, 2022

Unbalanced CPU usage, CPU stalls and reboots

Related topics