Occasionally, the kernel module on a new Threadripper 3960X desktop gets stuck, such that Xorg (if X is running) takes 100% of the CPU. The mouse is responsive, but keyboard and application-generated events take many (10+) seconds to show results on the screen. Anything else that talks to the kernel – like nvidia-bug-report.sh’s probe of GPU information – is also very slow.
If I kill the gdm systemd slice, and restart it, it continues to be slow. Only a reboot restores it.
I am not sure what triggers it. Two times it happened, I was opening new tabs in Google Chrome. Another time I only had terminals and the Psensor GUI open on my Gnome desktop.
This is a new computer, but nothing is overclocked, and Psensor reports steady temperatures.
What is the best way to find a root cause and/or a workaround for this?
Typical output for “perf top” for the Xorg process (with drivers 440.44-2) looks like this:
83.31% [kernel] [k] _nv030768rm 1.08% [kernel] [k] _nv020844rm 1.07% [kernel] [k] memset 1.06% [kernel] [k] _raw_spin_lock_irqsave 0.73% [kernel] [k] _raw_spin_unlock_irqrestore 0.71% [kernel] [k] _nv021189rm 0.57% [kernel] [k] _nv025639rm 0.54% [kernel] [k] _nv020842rm 0.54% [kernel] [k] _nv021232rm 0.52% [kernel] [k] _nv021233rm 0.52% [kernel] [k] _nv021027rm
The same issue happened with 430.64-5 last weekend, but the numbers in the obfuscated symbol names were different.