Xorg 100% CPU usage in kernel module on Debian testing with drivers 440.44-2 (RTX 2080 Super)

Occasionally, the kernel module on a new Threadripper 3960X desktop gets stuck, such that Xorg (if X is running) takes 100% of the CPU. The mouse is responsive, but keyboard and application-generated events take many (10+) seconds to show results on the screen. Anything else that talks to the kernel – like nvidia-bug-report.sh’s probe of GPU information – is also very slow.

If I kill the gdm systemd slice, and restart it, it continues to be slow. Only a reboot restores it.

I am not sure what triggers it. Two times it happened, I was opening new tabs in Google Chrome. Another time I only had terminals and the Psensor GUI open on my Gnome desktop.

This is a new computer, but nothing is overclocked, and Psensor reports steady temperatures.

What is the best way to find a root cause and/or a workaround for this?

Typical output for “perf top” for the Xorg process (with drivers 440.44-2) looks like this:

83.31%  [kernel]              [k] _nv030768rm
   1.08%  [kernel]              [k] _nv020844rm
   1.07%  [kernel]              [k] memset
   1.06%  [kernel]              [k] _raw_spin_lock_irqsave
   0.73%  [kernel]              [k] _raw_spin_unlock_irqrestore
   0.71%  [kernel]              [k] _nv021189rm
   0.57%  [kernel]              [k] _nv025639rm
   0.54%  [kernel]              [k] _nv020842rm
   0.54%  [kernel]              [k] _nv021232rm
   0.52%  [kernel]              [k] _nv021233rm
   0.52%  [kernel]              [k] _nv021027rm

The same issue happened with 430.64-5 last weekend, but the numbers in the obfuscated symbol names were different.

1 Like

You would hope for some support to give you a hand from NVidia. Shame nobody answered or helped. I paid £400 for my card and keep getting all these creepy bugs

Apologize for the delayed response.
@diego_gullo
Please help to confirm if you are facing same issue as described by MichaelP924.
If yes, please provide nvidia bug report in repro state.
If not, please describe issue with repro steps so that we can duplicate issue locally for debugging purpose.

For what it’s worth, I have not seen this behavior in months. I would guess a newer release of drivers (I am currently using Debian testing’s version of 450.80, previously 450.66) resolved the problem.

hi @amrits

I read this post while searching the forum because of what I have been facing.

I have an RTX 2070, Ubuntu 18.04.5 LTS, recently tried to upgrade the driver to nvidia-driver-455 . With this driver, being the latest , i seem to be facing issues when i start using the CUDA/TENSORFLOW functionality of the card.

Example of running darknet with a video file (mp4 in this case) https://github.com/bizmate/bash-essentials/blob/master/bin/darknet-detect-and-trash.sh#L78

When overloading the CPU with other tasks (lets say 30 browser windows, code editors java based, etc) I have been in a position where I cannot recover the X session at all. Even if I go to a terminal session (on ubuntu you can switch with CTRL + Alt + F3 to a terminal session) and kill all the excess process and restart the window manages I am still stuck to a black screen.

After I commented on this post I went back and reverted to use the driver nvidia-driver-450-server. I have restarted the machine and now running all the heavy processes as well as the darknet processes and I am not experiencing any problems.

Also I have noticed that when darknet starts doing the processing quite often I loose internet connectivity for some moments, the browser gives me an error and then when i reload it works. This happens all the times as I keep running darknet to process a lot of video and images data I have on my system.

Two problems above, the first one was quite frustrating, the last one would be great if it was fixed.
Please let me know if you would like these posted in separate tickets or if you would like more information.