Jetson TX2: kworker CPU usage at 100%

Hello everybody,

I’ve been struggling with this problem for a while now, and i couldn’t find a working solution.

The kworker process constantly uses 100% of a CPU and blocks everything, I can’t even shut the NVidia down.

I know it’s a known problem in the community, but all the solutions I found (this, for example https://askubuntu.com/questions/176565/why-does-kworker-cpu-usage-get-so-high) seem to pass through the ACPI interrupts manager, which is (apparently) not present on my board.

In my current setting, I have connected to the board a CAN bus and a LiDAR talking through ethernet, which are both using interrupts (as long as I know).

Thanks in advance for any help!

Hi,

Though the %utilization shown is wrong.
Can you tell me when the utilization shoots up? When you have just booted the system and its idling? Or when can bus is busy or whe LiDAR is talking through ethernet? Is the issue see without ethernet on?
I am just trying to locate where is the problem? Locally we have not seen it.

Can you also check in your system what is this kworker doing? For example using ftrace

thanks
Bibek

Thanks for your answer.
Unfortunately, I don’t have access to the NVidia in this moment, as soon as I can ftrace the problem I’ll post the result here.

The problem usually appears when both CAN and Ethernet are connected and talking. After it appears, there’s no way of stopping it, and stays constant to that percentage even when idling.

Please notice that I’ve observed this problem using different Lidars and Transceiver CAN, so I tend to exclude a hardware/software problem from that side.

I did try this (https://www.linuxquestions.org/questions/linux-software-2/high-cpu-usage-by-kworker-4175563563/) solution: it took longer to appear but appeared anyway.

Let me know if there’s something more that could help you!

Thanks!

Hi,

Can you dump all the task backtrace using sysrq
https://www.kernel.org/doc/html/v4.11/admin-guide/sysrq.html

Hi,

I finally managed to reproduce the error and used ftrace to log what I could.

The file out.txt corresponds to the output of:

$ cat /sys/kernel/debug/tracing/trace_pipe > out.txt

while out2.txt is the output of:

cat /proc/THE_OFFENDING_KWORKER/stack

with THE_OFFENDING_KWORKER being the PID of the kworker as seen from htop

Thanks in advance!
out2.txt (153 Bytes)
out.txt (3.44 MB)

out.txt:
Its showing two things:

  1. there is display related SMMU error. Wrong address which is out of display mapped region is trying to be accesses, which is throwing these errors. But I don’t think you are bothered about those.
    One thing is, CPU0 is only spewing this error. Not doing any workqueue job. If CPU 0 is stuck, then this could be the reason.

       <idle>-0     [000] d.h1     3.156581: arm_smmu_context_fault: Unhandled context fault: iova=0x96d82e40, fsynr=0x1, cb=19, sid=9(0x9 - NVDISPLAY), pgd=0 pud=0, pmd=0, pte=0
       <idle>-0     [000] d.h1     3.156609: arm_smmu_context_fault: Unhandled context fault: iova=0x96d86740, fsynr=0x1, cb=19, sid=9(0x9 - NVDISPLAY), pgd=0 pud=0, pmd=0, pte=0
       <idle>-0     [000] d.h1     3.156644: arm_smmu_context_fault: Unhandled context fault: iova=0x96d8a000, fsynr=0x1, cb=19, sid=9(0x9 - NVDISPLAY), pgd=0 pud=0, pmd=0, pte=0
       <idle>-0     [000] d.h1     3.156671: arm_smmu_context_fault: Unhandled context fault: iova=0x96d8e7c0, fsynr=0x1, cb=19, sid=9(0x9 - NVDISPLAY), pgd=0 pud=0, pmd=0, pte=0
       <idle>-0     [000] d.h1     3.156700: arm_smmu_context_fault: Unhandled context fault: iova=0x96d91e00, fsynr=0x1, cb=19, sid=9(0x9 - NVDISPLAY), pgd=0 pud=0, pmd=0, pte=
    
  2. Can you tell me which process id was hogging CPU this time?
    kworker/0:3 is not seen in this log.

Hi bbasu.

Thanks for your answer.

Regarding question 1, I agree with you. It seems that the display is giving problems. After reading the ftrace we’ve been working without any display attached and the problem has not appeared since then. Our guess is that is the problem. Is it a reasonable guess in your opinion? How to solve it?

Regarding question 2, the PID was 55. I don’t think it was kworker/0:3. What I’ve seen is that the kworker hogging cpu changes from time to time.

Thanks you very much for your time!

Yeah, we should fix the SMMU display issue.
What display panel you are using. Over HDMI or over DP?
Are you using Jetson or your customized Hardware?
Can you share the boot log?

I am using HDMI with Jetson.

I attached a txt file with the dmesg.

I think that what we are looking for is at time [0.244]

Thanks for your help
dmesg.txt.txt (69.5 KB)

Hi Tommaso

Thanks for the log.
Can you boot without HDMI connected and then connect after boot?
This issue was fixed in latest release, what release version you are using?

regards
Bibek